HotFuzz is a project in which we search for Algorithmic Complexity (AC) vulnerabilities in Java programs. These are vulnerabilities that can significantly slow down a program to exhaust its resources with small input. To use an analogy - it is easy to overwhelm a pizza place by ordering 100 pizzas at once, but can you think of a single pizza that would grind the restaurant's gears and bring operation to a halt?
We created a specialized fuzzer that ran on a customized JVM collecting resource measurements to guide programs towards worst-case behavior. We found 132 AC vulnerabilities in 47 top maven libraries and 26 AC vulnerabilities in the JRE including one in java.math that was reported and fixed as CVE-2018-1517.
Effects of exploiting an AC vulnerability can be similar to a Denial of Service attack. However, these are often linear in behavior - the attacker sends lots of requests that overwhelm the target system. For this project we are interested in attacks where input and effect are disproportional. For example, making one request and causing a program to be stuck in a loop exhausting it's CPU. There is some related work such as SlowFuzz or PerfFuzz. However, we decided this topic needs further exploration.
System
As opposed to fuzzing a program's main entry point, we fuzz all methods treating them as potential entry points. To generate input values for method parameters we use Small Recursive Instantiation (SRI) based on type signature. We recursively instantiate objects which can be passed into the method invocation, for primitive variables we generate values randomly. However, since uniform distributions are not ideal for Java objects we use a custom distribution which is described in the paper. These form the initial population for our genetic algorithm where we use CPU utilization as fitness function driving towards worst-case consumption. This process emulates natural selection where entities in a population are paired and produce cross-over offspring with some mutations. Typically fuzzers do cross-over on bit level, extending this technique to objects with type hierarchies is a central component of micro-fuzzing.
As baseline to compare against we used another instantiation approach called Identity Value Instantiation (IVI). For example: 0 for integer, empty string for strings, etc. IVI is the simplest possible value selection strategy possible and merely used to assess effectiveness of SRI.
For measurement microFuzz workers contain a custom JVM that is called EyeVM. EyeVM is built on top of OpenJDK with a HotSpot VM modified for measuring precise CPU consumption.
Output from micro-fuzzing is passed on to witness synthesis and validation where a generated program is run and measured on an unmodified JVM. This verification step reduces false positives where fuzzing causes programs to hang, for example polling a socket or file. The synthesized program calls the target method with help from the reflection API and the Google GSON library, using wall-clock to measure runtime.
HotFuzz system overview leading from jar file to AC witnesses
Results
We micro-fuzzed the JRE, the top 100 maven repositories, and challenges provided to us by DARPA. We found 132 AC vulnerabilities in 47 top maven libraries and 26 AC vulnerabilities in the JRE including one in java.math.
One finding that resulted in a CVE that was fixed by Oracle and IBM in their JRE implementations is CVE-2018-1517. HotFuzz found input to BigDecimal.add, which is an arbitrary precision library, that would lead to dramatically slowed down processing. In the case of IBM J9 we measured the process to be stuck for months. The cause for the slow down lies in how variables are allocated in the library. The bug and more results are presented in detail in the paper.
Other than comparing SRI with IVI fuzzing, a negative result that is not covered in the paper is that we also tried seeding fuzzing input with data from unit tests of libraries. These have not resulted in improved fuzzing results which was a surprise to us.
Summary
HotFuzz is a method-based fuzzer for Java programs that tries to drive input towards worst-case behavior in regards to CPU consumption. We found bugs in popular maven libraries and one CVE in java.math that lead to fixes by Oracle and IBM. The paper will be published at NDSS this year.
The pie chart is a very successful way of displaying proportions. However, it was never properly named by it's inventor William Playwright.
It has different names in different languages, but apparently often based on food. Pie in English, Cake in German, Pizza in Portuguese, and Camembert in Frech. This sparked my interest what else is out there. To solve this, I made a short Google Cloud program to translate "pie chart" to 103 different languages and back. I found that there are more odd cases than Camembert, such as "cookies" in Hmong, "skin" in Samoan, or "chart chart" Kazakh. These results are established via Google Cloud Translate only and not verified by native speakers.
Depiction of a Camembert. Or a pie. What's the difference really? File from Wikipedia: https://en.wikipedia.org/wiki/Camembert
Introduction
Pie charts are roughly 200 years old and are a very popular form of displaying proportions. Also, pie charts were not properly named when they were introduced.
Many of the names used for this chart around the world are based on food. For example, in German "pie chart" translates to "Tortendiagramm" which literally translated would be a "cake diagram". What prompted this investigation was this tweet. This post digs a bit into this phenomenon but I was curious whether I can go further. I thought there must be more to pie charts around the world.
Translations
As the year is 2019, all such investigations must involve the cloud. Therefore I used the Google Cloud Translate API to explore the depths of pie charts in 103 different languages.
In a first step I checked how the German and French case translate back to English. Theoretically, a flawless translation would go from "pie chart" to ? to "pie chart". This is called round-trip translation and considered a controversial quality metric for automated translation.
However, a literal translation would lose the original meaning and possibly expose different meanings. This is through limit of context - the translation engine makes a best effort picking out of multiple possibilities. While "camembert" translates back as "camembert", by adding more context translating "draw me a pie chart" translates to French as "me dessiner un camembert" and back identically. However, these literal translations due to lack of context are great for the purpose of this project.
To verify this briefly I translated the word to French and German - and back. Google Cloud Translation results in "Cake chart" and "Camembert". (The code for this is in the repository, function "de_fr_only".) As this worked out I went on to translate "pie chart" from English to 103 available languages, and back. The code is available here.
A note on limitations: I only receive one result per translation, languages could have several expressions for pie charts, I will only see one. Furthermore, the translations might simply be imperfect.
Results
Google cloud translation API supports 103 languages other than English. I wrote this program to translate "pie chart" to all of these languages and back.
The results are in the file all_the_pies.json, which is a JSON dictionary with two keys: "pie_to_foreign", and "foreign_to_pie". Each element contains a list "from" and "to", with the language indicator and the expression. This data structure is more verbose than required, a dictionary indexed by two-letter language code would have been sufficient for each. I did it this way as I wasn't sure what I want to do with the data later, and 103 languages are not all that much in the grand scheme of things.
When looking at the data, I noticed 74/103 of these are pie based, however, only 67 are "pie chart". Greek doubles the effort with "pie pie", Bengali has a "pie image", Portuguese is hedging bets with "pizza pie". Furthermore, "board pie": Haitian Creole, "graphic pie": Corsican, but most on-point is Afrikaans: "pie". 4/103 are cake based, however: 2x "diagram", 1x "chart" and 1x "table". I guess we could say that the pie chart takes the cake!
Other than cakes and pies or previously discussed examples, these 24 are different than the others.
card - Estonian
cart shape - Yoruba
chart chart - Kazakh
circular chart - Vietnamese
circular diagram - Bulgarian, Persian, Slovenian
circular graph - Croatian, Galician
diagram - Basque
drawing - Hausa
glassy table - Tajik
Graph of proportions - Romanian
Hidden chart - Pashto
organizer - Igbo
papa map - Maori
pastry photo - Azerbaijani
Pirate Chat - Sinhala
Point plan - Luxembourgish
skin - Samoan
table - Somali
table of the patient - Swahili
tablet - Hawaiian
the cookies - Hmong
Notably only "pastry photo" and "the cookies" are food based, all others are not related to food. I.e. 22/104 languages supported in Google Translate use an analogy not related to food for their expression of pie charts. While running this set me back 12 cents in cloud compute cost, I would argue that this insight was worth every penny!
Closing Thoughts
The majority of languages seems to relate pie charts to food, and within that mostly to pies. There are several notable exceptions that might seem obscure to English speakers. However, it remains an open question whether providing a proper name by Playwright for his creation 200 years ago would have lead to a less diverse naming situation for this chart. Maybe a good takeaway is to name inventions and systems straight away, as opposed to letting others name them.
As some might notice, this post lacks any actual pie charts. If you have been reading until here to see a pie chart, you have been tricked!
In this post we introduce Ex-Ray, our recently developed system. We use it to detect browser extensions which leak browsing history, regardless of their leakage channel. After analyzing Chrome extensions with more than 1,000 installations (10,691 total) we flagged 212 as leaking. We also found two extensions with large installation base that leak the users' history by means that were undetectable to prior work.
Our paper "Ex-Ray: Detection of History-Leaking Browser Extensions" is available for download here: pdf and bib. This project was a collaboration between Northeastern University and University College London. We will present the work at ACSAC this December.
The browser has become the primary interface for interactions with the Internet, from writing emails, to listening to music, to online banking. The shift of applications from the desktop to the Web has made the browser the de-facto operating system. Browser extensions can "extend" the core functionality of the browser, across all online activities of a user. They sometimes pave the way towards features which later become integrated into browsers themselves, such as password managers.
However, the access to powerful APIs given to extensions also allows for undesired side effects, such as invasion of privacy. This project is partially motivated by our previous analysis into the SimilarWeb browsing history data collection. We found 42 extensions that reported all of users' browsing history to a third party, often without it being required by the advertised functionality or disclosed in the terms of service.
This motivated us to investigate further and develop a more general detection system for privacy leaks in browser extensions. We wanted an approach that captures fundamental invariants of tracking browsing behavior that would be robust against obfuscation or encryption. Ex-Ray operates with two complementary systems in supervised and unsupervised fashion, and a triage system that would ease manual verification. We flagged 212 as history-leaking and discovered extensions that were leaking in ways that were out of scope for prior work. One extension was using strong encryption on tracking beacons before transfer, and the other one was using WebSockets. As our system works independently of the way of leaking, we were able to flag both.
Honeypot Probe
To gain insight into the environment in which trackers operate, and how data might be used, we configured a honeypot. We exercised extensions in a container and browsed by serving sites locally. Both Web and DNS were configured to work without interacting with the public Internet, except if extensions purposefully did so. We also operated a webserver with the same address on the public Internet that would collect incoming requests. As we encoded the extension ID into the URLs we visited, we were able to link incoming requests to extensions that have leaked them. After excluding VPN and proxy extensions, we found 38 extensions that would connect back to our honeypot. The confirmation that trackers are acting on leaked data motivated further steps in this work. We used these extensions as part of our ground truth for further experiments.
Here we compare extension execution to incoming request over time. We noticed that leaked history is often used immediately after it leaks to crawl the sites. These connections confirm that leaked browsing history is used by the receivers and is not leaked purely coincidentally. However, we identified no malicious behavior in our log files, such as vulnerability scans.
ABC ad blocking China special edition [translated]
CTRL-ALT-DEL new tab
Desprotetor de Links
Pop up blocker for Chrome
Similar Sites
nat-service.aws.kontera.com
Chistodeti
Woopages
199.175.48.183
static.36.51.9.176.clients.your-server.de
Other than the behavior over time, another aspect is possible collaboration between extension authors. In our honeypot probe we observed hosts that connected to multiple URLs unique to extensions, and conversely URLs that received connections from multiple hosts. These relations are possible indicators for a form of data sharing or shared infrastructure between trackers. Each line in this table consists of such a connected group.
System Description
Our system has three main components.
Unsupervised learning: based on counterfactual analysis on network traffic over multiple executions, we detect history-stealing extensions.
Triage-based analysis: A scoring system that can highlight extensions which have suspicious traffic behavior. It can be used as a pre-processing step to manually vet extensions.
Supervised learning: Using a labeled dataset from previous experiments, we can systematize identification of suspicious extensions. We build a model that detects leaks based on APIÂ calls.
In this post we will focus on the unsupervised learning component, for the other components we refer to the paper.
Comparison of sent traffic over several execution stages with increasing amount of history. On the Left we see history-leaking extensions, and on the right benign ones. Data that is sent out by extensions varies little for benign extensions, but for trackers it will vary depending on the amount of history supplied.
To identify privacy-violating extensions, we exercise them in multiple stages, changing the amount of private data supplied to the browser, and in turn to the extension under test. Based on the type of extension, the traffic usage can change depending on the number of visited sites. However, the underlying assumption is that benign extension traffic should not be influenced by the size of the browsing history.
Based on this insight, We use linear regression on each set of flows to estimate the optimal set of parameters that support the identification of history-leaking extensions. We aim to establish a causality relation between two variables: (i) the amount of raw data sent through the network and (ii) the amount of history leaked to a given domain. For this, we rely on the counterfactual analysis model. We use the size of history we provide to an extension as input variable to a controlled environment. Next, we observe outgoing traffic as an output variable for our classification. We also use other indicators such as lower bound of compressed history as cut-off value. The details of our detection engine are described in detail in the full paper (see links at top and bottom of post).
Ex-Ray extension execution overview. After downloading extensions from the Chrome Web Store, we exercise them in containers to collect traces for classification. To support our honeypot experiment we only access Web and DNS locally. As the subdomains we use are unique per extension and we keep the connections local to a container, leaks can be linked to the extension under test.
Results
In total, Ex-Ray flagged 212 Chrome extensions as history-leaking. This included two extensions which were undetectable to prior work. Web of Trust uses strong encryption (RC4) on extension level, before transfering data via HTTPS. Coupon Mate is an extension that leaks browsing history via WebSockets, which is used by 0.96% of extensions that we analyzed. Prior work uses keyword analysis on particular protocols, which would not have triggered on these two extensions.
Our dataset of flagged extensions and a triage report are available in our repository.
The amount of extensions leaking history is troublesome, in particular as this is possible for extensions with only modest permission access. While tracking on websites is prevalent, websites have to opt-in for it and solutions exist that allow users to remove them (e.g., Ghostery). Conversely, tracking in browser extensions covers all visited websites and no opt-out mechanism exists. This behavior does not seem to be monitored for in extension stores.
Takeaways
Our key takeaways from this project are as follows:
It is easy for a browser extension to monitor and report browsing to a third party without requesting suspicious permissions.
Extensions utilize leaking channels that have not been considered by state-of-the-art leak detection before.
Leaking behavior can be detected in a robust way with a combination of supervised and unsupervised methods, for example with a system such as Ex-Ray
Extension stores should monitor for such behavior and alert users of history leaks.
As a general precaution, users should be careful when installing browser extensions, as stores do not monitor for such behavior currently.
Conclusion
We introduce a new method for detection of privacy-violating browser extensions, independently of their protocol, and developed a prototype system: Ex-Ray. Our system uses a combination of supervised and unsupervised methods to identify features characteristic to leaking extensions. We analyzed all extensions from the Chrome Web store with more than 1,000 installations (10,691 total) and flagged 212 extensions as history-leaking. Two extensions that we flagged were leaking history in previously undetectable ways. We suggest that extensions should be both tested more rigorously when admitted to the store, as well as monitored while they execute within browsers. Our paper is available for download here: ( pdf and bib ).
This post describes a system we developed recently to re-introduce humans to automated vulnerability discovery. While human experts can find bugs unreachable to automated bug finding, we were curious whether untrained humans can help automated systems to do better. We found that by integrating human labor with no prior experience in bug finding, otherwise automated systems can overcome some of their shortcomings and find more bugs than they could on their own. We were able to recruit 183 workers through Amazon Mechanical Turk who helped increase program coverage. In effect this lead to a 55% improvement in finding bugs for Cyber Grand Challenge (CGC) binaries. This blog post will discuss key insights and material that did not fit into our forthcoming CCS paper (pdf and bib) "Rise of the HaCRS". The paper was a collaboration between UC Santa Barbara, Arizona State University, and Northeastern University.
Update: additional materials (slides, video) available here.
Introduction
Mechanical Phish is an open source Cyber Reasoning System (CRS) that scored third in last year's CGC event. CGC was a fully automated hacking competition with no human interaction, the first computer vs. computer hacking contest. While this pushed forward automated reasoning, it also highlighted shortcomings in the state of the art of automated bug finding. In this project we enhance fully automated bug finding by adding human assistance to cover areas where human intuition beats computing power.
A shortcoming of fully automated analyses is that tools start without real input and have to explore programs on their own. While lacking intuition, these tools can still fare well, for example AFL can reconstruct JPG file format on it's own, which is impressive. But we were curious whether better input seeds help automated reasoning and found through experimentation that we were able to enhance results significantly. In particular human intuition allows to distinguish states that are logically different, e.g.: winning a game as opposed to losing a game. While automated systems might be able to differentiate, the implications are not clear. Or more generally: semantic hints given by programs go unnoticed by a CRS.
We developed a prototype system which we tested on Amazon Mechanical Turk, evaluating against the CGC sample binary corpus. The results back our suspicion that new inputs can improve CRS findings significantly.
Mechanical Turk
Amazon offers access to human assistants where requesters can offer tasks to be solved for money. This service is often used to gather data where automation is infeasible or results must come from a human (e.g.: surveys). While our system is not designed specifically for Mechanical Turk, we chose the platform due to it's vast access to workers. In HaCRS, a "Tasklet" is a request for human work to solve an issue the CRS can't deal with on it's own. We issue these in steps. E.g., to improve coverage to a specific target, and once that's done we aim higher.
We armed our system with Amazon credits and iteratively let it issue HITs, requesting labor to increase coverage, such that Mechanical Phish can find more bugs. We had the system generally request coverage increases of 10%, and scale the payout based on difficulty. For example while a tasklet we thought of as easy would earn $1, a particularly hard one would be worth $2.5. Performance was measured in triggered program transitions, we provided live feedback as the Turkers were exercising the programs (see screenshots below). We further issued bonus payments based on performance that went further than required, so Turkers would be encouraged to exercise programs further. In total we paid $1,100 in base payment and bonuses to 183 Turkers.
HaCRS User Interface: The Human-Automation Link (HAL)
As we were hoping to enroll large amounts of unskilled labor in our experiments, the UI had to be self-explanatory to scale. Issues with the UI would result in confused emails and result in loss of time on both ends. We tried to fit all information the Turkers could need, and offer all options that could make them work faster.
Mechanical Turk does not allow for Turkers to install software for tasks. This is for good reason as requesters could exploit this to let them install malware or other unwanted software. However, this also presented a challenge for us: our interface needed to be accessible to them while observing this restriction. We decided to build a Web UI for our system, adding a noVNC JavaScript window where we presented the interaction terminal. This choice also lets us be flexible in the future, we can reuse most of the UI while pointing noVNC to other targets.
Above we see the HaCRS Human-Automation Link (HAL). Turkers can type in the terminal to interact with the program. To the left is the progress window. We see how many transitions have been triggered and how many more need to be triggered to receive a payout.
Turkers see previous input / output sequences and can restore these states by clicking on the character in the interaction. All inputs are available to all Turkers. I.e.: if any Turker manages to reach a previously unknown program state, they can pick up from there and explore further without manually repeating all steps. A click will spawn a new docker container in the backend, replay the interaction, and be available to the Turker via noVNC. Note that such replay is only possible for systems where randomness is controlled, this is a general limitation and not specific to HaCRS.
We also offer programmatic input suggestions based on strings that might be encountered later, which the Mechanical Phish otherwise lacks program context to use directly. These strings can function as inspiration to humans to exercise the program better.
Sample program: NRFIN_00005
We will demonstrate HaCRS capabilities based on NRFIN_00005. This application is a game described as "Tic-Tac-Toe, with a few modifications". The player does not see the game board and has to keep track of state on their own. See screenshots above for gameplay and sample inputs. The game has a null pointer dereference bug, which can be triggered after one round has been played and typing "START OVER". Other strings will not trigger the vulnerability.
Driller and AFL (the two main components of Mechanical Phish) were not able to play the game successfully, as they cannot reason about the state of the game. Our Turkers however were able to win the game easily, but typed strings such as "PLAY AGAIN" afterwards, which does not trigger the bug. Next, Mechanical Phish picks up the Turker input and mutates it towards "START OVER", as it recognizes this as a special state, and crashes the program.
Takeaways
Our key takeaways from this project are as follows:
Input seeds can impact CRS results significantly, and should be used in conjunction with symbolic execution and fuzzing.
Even unskilled users' intuition can improve CRS results.
Mechanical Turk turned out to be a good platform for collecting diverse program interactions.
Semi-experts did not fare significantly better than non-expert users. However, this could be a limitation of our system.
Future Work
For HaCRS, we used humans to increase program coverage to reach states which Mechanical Phish could turn into crashes. However, we envision to involve humans in other areas to enhance CRSs. For example enroll them more directly into exploit generation, or testing patches to verify fixes. These tasks might be less suitable for unskilled labor, and will require more research. Furthermore, finding optimal incentive structures could increase performance of such systems.
Conclusion
We had a total of 183 Turkers work for us at a combined cost of $1,100. These Turkers managed to help Mechanical Phish find 55% more bugs than it could on it's own. HaCRS presents a step towards augmenting traditional CRSs with human intuition where computers are still lacking. Such a combined approach should be further explored to overcome CRS obstacles. Our paper features case studies and implementation details about our system. The full paper is available here: pdf and bib, and will be presented at CCS in Dallas.