These Chrome extensions spy on 8 million users

Overview


This post investigates the upalytics.com library for Chrome extensions performing real time tracking of users on all sites they visit. The code is bundled with plenty of "free" extensions, exfiltrating browsing history as a feature. Such software is commonly known as spyware. Within the top 7,000 extensions of the Chrome Web store, the library is used 42 times with over 8 million installs. The post also looks into the relationship of upalytics with similarweb.com. The compiled data is also available in this spreadsheet.

Update: We published a paper about a system to automatically find such extensions.

Intro


I came across a website that offered browsing insights for websites they have no clear relation to, similarweb. The data includes links clicked on a site, referrer statistics, the origin of users, and others. While this is interesting, it also raises a question - where is that data coming from? Based on their website they collect data from millions of devices, but the software they advertise was orders of magnitude away from that. Data had to come from somewhere else.

Bundling unwanted content with "free" software is an unfortunate reality which has been shown before. This quickly became my working theory. Tracking browsing behavior alone is nothing new, but I was surprised by how widespread this library turned out to be.

Methodology


I started with the similarweb Chrome Extension, this is where I first came across the upalytics library. By doing some code reading I noticed it was tracking browsing habits and reporting it in real time. Next I started looking for similarities between this extension and the 7,000 most popular ones offered in the Chrome Web store.

Step one was an educated grep - looking for the "upalytics" string, which led to the first hits. What these libraries had in common is the string "SIMPLE_LB_URL" when accessing the backend API. Searching for that lead to more results, not all libraries contain the "upalytics" string.

To evaluate these extensions I wanted to know:

I changed the endpoint address in each extension to point towards my server and evaluated each extension.

Results


I found 42 extensions which used the library totaling 8M installs. Note: "Facebook Video Downloader" (1,000 installs) required updating of the manifest to install.

Containing the code alone does not imply an extension exfiltrates data. But, manual testing confirmed: every single one was tracking browsing behavior. With every requested site, the extensions will send another POST request in the background to announce the action. What is particularly problematic is that some of these extensions pretend to be security relevant. Including phishing protection or content filters.

Out of these 42 extensions 23 did not mention data collection in their terms, out of these 12 further have no URL where this would be explained. One URL that is used across 12 extensions to explain the privacy ramifications is http://addons-privacy.com. The only extension offering opt-in to tracking is "SpeakIt!". They had an issue opened here where someone pointed this out as spyware before introduction of the opt-in step.

All data is compiled into a spreadsheet, available here.

Noteworthy examples


Do it - a Shia LaBeouf motivator: In exchange for browsing history users can get motivated by Shia. The extension offers a button that will make him pop up and shout a motivational quote. 200 thousand users considered this a good deal, who am I to judge? :-)

Video AdBlock for Chrome - this extension is advertised as "ADWARE FREE We are not injecting any third-party ads!". Technically this might be correct. Is spyware and adware the same?

Taking a peek


To see what is transmitted I modified the phishing extension (and all others) to post data to my local server instead of theirs. This was fairly simple - I set up a python Flask application that accepts POST requests to /related and GET requests to /settings. The POST data is base64 encoded - twice. Why twice? I don't know. Below is the data the server-side sees while the client is browsing. Line breaks inserted to help readability.

# We go to bing, after previously visiting asdf.com:

s=714&md=21&pid=gvOq01lLa3ZBt6z&sess=475474837468937000&q=http://www.bing.com/
&prev=http://asdf.com/&link=0&sub=chrome&hreferer=&tmv=3015


# We send a query "this is a test":

s=714&md=21&pid=gvOq01lLa3ZBt6z&sess=475474837468937000&q=http://www.bing.com/search?
q=this+is+a+test&go=Submit&qs=n&form=QBLH&pq=this+is+a+test&sc=8-14&sp=-1&sk=&
cvid=456B43655F44452BB33CC9AE204294B3&prev=http://www.bing.com/&link=1&
sub=chrome&hreferer=http://www.bing.com/&tmv=3015


# We click a link on the bing results:

s=714&md=21&pid=gvOq01lLa3ZBt6z&sess=475474837468937000&q=https://en.wikipedia.org/wiki/This_Is_Not_a_Test!&
prev=http://www.bing.com/search?q=this+is+a+test&go=Submit&qs=n&form=QBLH&pq=this+is+a+test&sc=8-14
&sp=-1&sk=&cvid=456B43655F44452BB33CC9AE204294B3&link=1&sub=chrome&hreferer=http://www.bing.com/search?q=this+is+a+test
&go=Submit&qs=n&form=QBLH&pq=this+is+a+test&sc=8-14&sp=-1&sk=&cvid=456B43655F44452BB33CC9AE204294B3&tmv=3015

What data will be transmitted?

As far as I can tell this will not be transmitted:

The network view


The endpoints that receive the data use a variety of domain names with multiple IPs. These 42 extension use nine distinct domains, eight of which use the same subdomain (lb.domain.com), one is a subdomain of upalytics.com. I suspect an attempt to distract from the impression that all data flows to one company. The domain names include ones that are supposed to look benign, connectupdate.com, secureweb24.net, searchelper.com. The other domains involved are: crdui.com, datarating.com, similarsites.com, thetrafficstat.net, webovernet.com.

All these domains are registered with domainsbyproxy, a service used to obscure the ownership of domain names. This includes upalytics.com itself which is used in one of the extensions (Speakit!). Also, the robots.txt file used in all cases is the same.

What's more interesting: All these IPs belong to the same hoster, XLHost.com. Eight out of nine of these hosts have all addresses in a /18 network, half of the IPs of the upalytics.com endpoint are in another xlhost network. For browsing convenience (or your firewall?) the list of IPs is available here. All IPs in use are unique, however, this involves consecutive IP addresses and other neighborhood relationships.

To examine this closer I compared the distance of IP addresses used by these extensions for tracking. In the graph below, the nodes are the nine domain names in use, edges are amount x distance. By taking into account distances of up to four, we can link together all hostnames used in all 42 extensions. For example: IPs "1.1.1.1" and "1.1.1.3" have a distance of 2. As for the labels, the edge between "similarsites.com" and "thetrafficstat.net" reads "6x2". This means that the domains share 6 IP addresses with a distance of 2. Before the graph, this is the relationship between lb.crdui.com and lb.datarating.com:

IP distance

Combining all hosts into one graph, we get this:

connection graph

What does this imply? Whether this is one large data kraken or pure coincidence, I will leave for the reader to decide.

Is this malware, an unwanted feature, or totally OK?


Some of these extensions have terms that mention privacy, here is an example:

We consider that the global measuring and ranking of the Internet in the current market is somewhat underdeveloped and obscure. For this reason, we have undertaken a large global project which bring a powerful improvement in the public’s perception of internet trends and expand the overall comprehension of the dynamics that are happening on the internet on daily basis. In order to make this goal a reality, we need anonymous data such as browsing patterns, statistics and information on how our features are being used. When installing one of our free products, you will expect to become a proud part of this project and make this change happen together with us. If you want more details on the interaction that will be going on between your browser and our servers, feel free to check out our Privacy Policy. By installing our product you adhere to the Terms and Conditions as well as Privacy Policy adhered on: http://crxmousetou.com/

Calling the data "anonymous" seems bold, an IP alone can often be used to uniquely identify users, let alone browsing history. Based on this text the majority of users might not be aware of the extent of monitoring. I was surprised myself by the boldness of the tracking. However, even if this was laid out clearly in the terms, common sense dictates that browser extensions have no business recording unrelated traffic.

That being said, this behavior could be in violation of the Extension Quality Guidelines, in particular the "single purpose" rule. Whether this is the case, I can not judge.

Limitations


This post looks into usage of this one library in the Chrome Extensions in the Chrome Web store alone. The number of extensions I found is to be considered as a lower bound, there could be well more. For the extensions I examined I did not check other libraries that were loaded or checked for behavior other than tracking browsing history. Upalytics also offers libraries for other platforms (Smartphones, Desktop, other browsers) - I did not take a look at these either.

Closing


This is just one library for one platform. Uplaytics supports all major smartphones, browsers but also Microsoft and Mac platforms. Also, there are more players in the game than this one.

I'm afraid to say that even if all these extensions get nuked from the store, there might be plenty similar libraries in other extensions.

Updates


04/01/16: None of these extensions are accessible in Google Web store at this point.
03/31/16: I expanded on the explanation of the IP relationships.
10/05/17: We published a paper to detect such leaks automatically. See here for details.