05 Oct 2017
In this post we introduce Ex-Ray, our recently developed system. We use it to detect browser extensions which leak browsing history, regardless of their leakage channel. After analyzing Chrome extensions with more than 1,000 installations (10,691 total) we flagged 212 as leaking. We also found two extensions with large installation base that leak the users' history by means that were undetectable to prior work.
Our paper "Ex-Ray: Detection of History-Leaking Browser Extensions" is available for download here: pdf and bib. This project was a collaboration between Northeastern University and University College London. We will present the work at ACSAC this December.
Update: the ACSAC slides are available here.
The browser has become the primary interface for interactions with the Internet, from writing emails, to listening to music, to online banking. The shift of applications from the desktop to the Web has made the browser the de-facto operating system. Browser extensions can "extend" the core functionality of the browser, across all online activities of a user. They sometimes pave the way towards features which later become integrated into browsers themselves, such as password managers.
However, the access to powerful APIs given to extensions also allows for undesired side effects, such as invasion of privacy. This project is partially motivated by our previous analysis into the SimilarWeb browsing history data collection. We found 42 extensions that reported all of users' browsing history to a third party, often without it being required by the advertised functionality or disclosed in the terms of service.
This motivated us to investigate further and develop a more general detection system for privacy leaks in browser extensions. We wanted an approach that captures fundamental invariants of tracking browsing behavior that would be robust against obfuscation or encryption. Ex-Ray operates with two complementary systems in supervised and unsupervised fashion, and a triage system that would ease manual verification. We flagged 212 as history-leaking and discovered extensions that were leaking in ways that were out of scope for prior work. One extension was using strong encryption on tracking beacons before transfer, and the other one was using WebSockets. As our system works independently of the way of leaking, we were able to flag both.
To gain insight into the environment in which trackers operate, and how data might be used, we configured a honeypot. We exercised extensions in a container and browsed by serving sites locally. Both Web and DNS were configured to work without interacting with the public Internet, except if extensions purposefully did so. We also operated a webserver with the same address on the public Internet that would collect incoming requests. As we encoded the extension ID into the URLs we visited, we were able to link incoming requests to extensions that have leaked them. After excluding VPN and proxy extensions, we found 38 extensions that would connect back to our honeypot. The confirmation that trackers are acting on leaked data motivated further steps in this work. We used these extensions as part of our ground truth for further experiments.
Here we compare extension execution to incoming request over time. We noticed that leaked history is often used immediately after it leaks to crawl the sites. These connections confirm that leaked browsing history is used by the receivers and is not leaked purely coincidentally. However, we identified no malicious behavior in our log files, such as vulnerability scans.
- Sochabra for Stand Alone [translated]
- 500px image downloader
- EyeEm Image Downloader
- Facebook Image downloader
- Flickr image downloader
- Image Downloader for Facebook & Instagram
- Pinterest Image downloader
- ABC ad blocking China special edition [translated]
- CTRL-ALT-DEL new tab
- Desprotetor de Links
- Pop up blocker for Chrome
- Similar Sites
Other than the behavior over time, another aspect is possible collaboration between extension authors. In our honeypot probe we observed hosts that connected to multiple URLs unique to extensions, and conversely URLs that received connections from multiple hosts. These relations are possible indicators for a form of data sharing or shared infrastructure between trackers. Each line in this table consists of such a connected group.
Our system has three main components.
- Unsupervised learning: based on counterfactual analysis on network traffic over multiple executions, we detect history-stealing extensions.
- Triage-based analysis: A scoring system that can highlight extensions which have suspicious traffic behavior. It can be used as a pre-processing step to manually vet extensions.
- Supervised learning: Using a labeled dataset from previous experiments, we can systematize identification of suspicious extensions. We build a model that detects leaks based on APIÂ calls.
In this post we will focus on the unsupervised learning component, for the other components we refer to the paper.
Comparison of sent traffic over several execution stages with increasing amount of history. On the Left we see history-leaking extensions, and on the right benign ones. Data that is sent out by extensions varies little for benign extensions, but for trackers it will vary depending on the amount of history supplied.
To identify privacy-violating extensions, we exercise them in multiple stages, changing the amount of private data supplied to the browser, and in turn to the extension under test. Based on the type of extension, the traffic usage can change depending on the number of visited sites. However, the underlying assumption is that benign extension traffic should not be influenced by the size of the browsing history.
Based on this insight, We use linear regression on each set of flows to estimate the optimal set of parameters that support the identification of history-leaking extensions. We aim to establish a causality relation between two variables: (i) the amount of raw data sent through the network and (ii) the amount of history leaked to a given domain. For this, we rely on the counterfactual analysis model. We use the size of history we provide to an extension as input variable to a controlled environment. Next, we observe outgoing traffic as an output variable for our classification. We also use other indicators such as lower bound of compressed history as cut-off value. The details of our detection engine are described in detail in the full paper (see links at top and bottom of post).
Ex-Ray extension execution overview. After downloading extensions from the Chrome Web Store, we exercise them in containers to collect traces for classification. To support our honeypot experiment we only access Web and DNS locally. As the subdomains we use are unique per extension and we keep the connections local to a container, leaks can be linked to the extension under test.
In total, Ex-Ray flagged 212 Chrome extensions as history-leaking. This included two extensions which were undetectable to prior work. Web of Trust uses strong encryption (RC4) on extension level, before transfering data via HTTPS. Coupon Mate is an extension that leaks browsing history via WebSockets, which is used by 0.96% of extensions that we analyzed. Prior work uses keyword analysis on particular protocols, which would not have triggered on these two extensions.
Our dataset of flagged extensions and a triage report are available in our repository.
The amount of extensions leaking history is troublesome, in particular as this is possible for extensions with only modest permission access. While tracking on websites is prevalent, websites have to opt-in for it and solutions exist that allow users to remove them (e.g., Ghostery). Conversely, tracking in browser extensions covers all visited websites and no opt-out mechanism exists. This behavior does not seem to be monitored for in extension stores.
Our key takeaways from this project are as follows:
- It is easy for a browser extension to monitor and report browsing to a third party without requesting suspicious permissions.
- Extensions utilize leaking channels that have not been considered by state-of-the-art leak detection before.
- Leaking behavior can be detected in a robust way with a combination of supervised and unsupervised methods, for example with a system such as Ex-Ray
- Extension stores should monitor for such behavior and alert users of history leaks.
- As a general precaution, users should be careful when installing browser extensions, as stores do not monitor for such behavior currently.
We introduce a new method for detection of privacy-violating browser extensions, independently of their protocol, and developed a prototype system: Ex-Ray. Our system uses a combination of supervised and unsupervised methods to identify features characteristic to leaking extensions. We analyzed all extensions from the Chrome Web store with more than 1,000 installations (10,691 total) and flagged 212 extensions as history-leaking. Two extensions that we flagged were leaking history in previously undetectable ways. We suggest that extensions should be both tested more rigorously when admitted to the store, as well as monitored while they execute within browsers. Our paper is available for download here: ( pdf and bib ).
31 Mar 2016
This post investigates the upalytics.com library for Chrome extensions performing real time tracking of users on all sites they visit. The code is bundled with plenty of "free" extensions, exfiltrating browsing history as a feature. Such software is commonly known as spyware. Within the top 7,000 extensions of the Chrome Web store, the library is used 42 times with over 8 million installs. The post also looks into the relationship of upalytics with similarweb.com. The compiled data is also available in this spreadsheet.
Update: We published a paper about a system to automatically find such extensions.
I came across a website that offered browsing insights for websites they have no clear relation to, similarweb. The data includes links clicked on a site, referrer statistics, the origin of users, and others. While this is interesting, it also raises a question - where is that data coming from? Based on their website they collect data from millions of devices, but the software they advertise was orders of magnitude away from that. Data had to come from somewhere else.
Bundling unwanted content with "free" software is an unfortunate reality which has been shown before. This quickly became my working theory. Tracking browsing behavior alone is nothing new, but I was surprised by how widespread this library turned out to be.
I started with the similarweb Chrome Extension, this is where I first came across the upalytics library. By doing some code reading I noticed it was tracking browsing habits and reporting it in real time. Next I started looking for similarities between this extension and the 7,000 most popular ones offered in the Chrome Web store.
Step one was an educated grep - looking for the "upalytics" string, which led to the first hits. What these libraries had in common is the string "SIMPLE_LB_URL" when accessing the backend API. Searching for that lead to more results, not all libraries contain the "upalytics" string.
To evaluate these extensions I wanted to know:
- Does installing the extension exfiltrate data?
- Does tracking happen out of the box, or does the user have to opt-in?
- Is this mentioned in the terms of service?
- If not, is there at least a link in the terms of service that explains what is happening?
I changed the endpoint address in each extension to point towards my server and evaluated each extension.
I found 42 extensions which used the library totaling 8M installs. Note: "Facebook Video Downloader" (1,000 installs) required updating of the manifest to install.
Containing the code alone does not imply an extension exfiltrates data. But, manual testing confirmed: every single one was tracking browsing behavior. With every requested site, the extensions will send another POST request in the background to announce the action. What is particularly problematic is that some of these extensions pretend to be security relevant. Including phishing protection or content filters.
Out of these 42 extensions 23 did not mention data collection in their terms, out of these 12 further have no URL where this would be explained. One URL that is used across 12 extensions to explain the privacy ramifications is http://addons-privacy.com. The only extension offering opt-in to tracking is "SpeakIt!". They had an issue opened here where someone pointed this out as spyware before introduction of the opt-in step.
All data is compiled into a spreadsheet, available here.
Do it - a Shia LaBeouf motivator: In exchange for browsing history users can get motivated by Shia. The extension offers a button that will make him pop up and shout a motivational quote. 200 thousand users considered this a good deal, who am I to judge? :-)
Video AdBlock for Chrome - this extension is advertised as "ADWARE FREE We are not injecting any third-party ads!". Technically this might be correct. Is spyware and adware the same?
Taking a peek
To see what is transmitted I modified the phishing extension (and all others) to post data to my local server instead of theirs. This was fairly simple - I set up a python Flask application that accepts POST requests to /related and GET requests to /settings. The POST data is base64 encoded - twice. Why twice? I don't know. Below is the data the server-side sees while the client is browsing. Line breaks inserted to help readability.
# We go to bing, after previously visiting asdf.com:
# We send a query "this is a test":
# We click a link on the bing results:
What data will be transmitted?
- Every visited website
- Search queries (Google, Bing, etc. )
- Websites visited on internal networks
As far as I can tell this will not be transmitted:
- POST data (e.g.: passwords, usually)
The network view
The endpoints that receive the data use a variety of domain names with multiple IPs. These 42 extension use nine distinct domains, eight of which use the same subdomain (lb.domain.com), one is a subdomain of upalytics.com. I suspect an attempt to distract from the impression that all data flows to one company. The domain names include ones that are supposed to look benign, connectupdate.com, secureweb24.net, searchelper.com. The other domains involved are: crdui.com, datarating.com, similarsites.com, thetrafficstat.net, webovernet.com.
All these domains are registered with domainsbyproxy, a service used to obscure the ownership of domain names. This includes upalytics.com itself which is used in one of the extensions (Speakit!). Also, the robots.txt file used in all cases is the same.
What's more interesting: All these IPs belong to the same hoster, XLHost.com. Eight out of nine of these hosts have all addresses in a /18 network, half of the IPs of the upalytics.com endpoint are in another xlhost network. For browsing convenience (or your firewall?) the list of IPs is available here. All IPs in use are unique, however, this involves consecutive IP addresses and other neighborhood relationships.
To examine this closer I compared the distance of IP addresses used by these extensions for tracking. In the graph below, the nodes are the nine domain names in use, edges are amount x distance. By taking into account distances of up to four, we can link together all hostnames used in all 42 extensions. For example: IPs "18.104.22.168" and "22.214.171.124" have a distance of 2. As for the labels, the edge between "similarsites.com" and "thetrafficstat.net" reads "6x2". This means that the domains share 6 IP addresses with a distance of 2. Before the graph, this is the relationship between lb.crdui.com and lb.datarating.com:
Combining all hosts into one graph, we get this:
What does this imply? Whether this is one large data kraken or pure coincidence, I will leave for the reader to decide.
Is this malware, an unwanted feature, or totally OK?
Some of these extensions have terms that mention privacy, here is an example:
Calling the data "anonymous" seems bold, an IP alone can often be used to uniquely identify users, let alone browsing history. Based on this text the majority of users might not be aware of the extent of monitoring. I was surprised myself by the boldness of the tracking. However, even if this was laid out clearly in the terms, common sense dictates that browser extensions have no business recording unrelated traffic.
That being said, this behavior could be in violation of the Extension Quality Guidelines, in particular the "single purpose" rule. Whether this is the case, I can not judge.
This post looks into usage of this one library in the Chrome Extensions in the Chrome Web store alone. The number of extensions I found is to be considered as a lower bound, there could be well more. For the extensions I examined I did not check other libraries that were loaded or checked for behavior other than tracking browsing history. Upalytics also offers libraries for other platforms (Smartphones, Desktop, other browsers) - I did not take a look at these either.
This is just one library for one platform. Uplaytics supports all major smartphones, browsers but also Microsoft and Mac platforms. Also, there are more players in the game than this one.
I'm afraid to say that even if all these extensions get nuked from the store, there might be plenty similar libraries in other extensions.
04/01/16: None of these extensions are accessible in Google Web store at this point.
03/31/16: I expanded on the explanation of the IP relationships.
10/05/17: We published a paper to detect such leaks automatically. See here for details.