Understanding the website fingerprinting setting

Website fingerprinting attacks are attacks where a local and passive attacker attempts to identify web pages accessed by a client, despite that the client is using technologies like Tor, SSH tunneling, proxies, or VPNs. Our project is focused on Tor, so first we get a basic understanding of Tor.

Tor

To enable you to browse the Internet anonymously, Tor sends your Internet traffic through three relays: a guard relay, a middle relay, and an exit relay. A relay is a computer part of the tor network run by volunteers. At the time of writing there are approximately 7000 relays in the network. Your Internet traffic enters the tor network through the guard relay and exits to its destination, such as a website, through the exit relay.

Tor setting

The guard relay knows the client’s network identity (IP-address). In other words: the guard relay guards the network identity of the client from the rest of the tor network. The exit relay knows the client’s destination (the website). In other words: the traffic exits the network at the exit relay. The middle relay only knows the identity of the guard and exit relays. In other words: the middle relay is in the middle between the guard and exit. Assuming the relays are not colluding, nobody knows that a particular client is accessing a particular website. You can learn more about Tor here.

The attacker

Our attacker is both local and passive. By local we mean that the attacker is relatively close to the client and therefore knows its network identity (IP-address). For example, the attacker could be observing all traffic through the client’s wireless home router, the client’s ISP, or even all traffic going through the client’s guard relay in the tor network. By passive we mean that the attacker is only observing network traffic and does not modify, delay, drop, or inject traffic belonging to the client.

Local attacker

Such an attacker is in principle an invisible threat to clients and within the scope of Tor’s threat model. The attacker has its work cut out for it though: all traffic is encrypted. Furthermore, we assume that the underlying crypto is perfect, meaning that the attacker cannot break it to learn the plaintext (page) contained within. As it turns out though, this is more than enough for the attacker.

The attack

So for website fingerprinting attacks, the local and passive attacker observes encrypted traffic from an identified client. The attacker has a list of monitored pages, as part of websites. Each time the client visits a page, the goal of the attacker is to identify or classify which page on its list of monitored pages the client is visiting (if any). In laboratory experiments, researchers often only consider a closed world where the client only visits pages that are monitored. In a more realistic setting, the client can visit pages that are non-monitored. This is called an open world.

You are being monitored

The attack is divided into two phases: training and testing. In the training phase, the attacker visits each page on its monitored list and repeats this procedure a fixed number of times (the number of instances of each page). In the open world, the attacker also visits a fixed number of non-monitored pages. The purpose of visiting pages is to train a classifier that uses machine learning to classify traffic as belonging to different classes (pages) that are either monitored or non-monitored. In the testing phase, the attacker only gets traffic traces of encrypted traffic and attempts to classify which pages are being visited by the client.

Surprisingly accurate

So how good are website fingerprinting attacks today? Surprisingly good it turns out, at least in the closed world. Below you find a results table from Tao Wang’s PhD thesis investigating the results of all published attacks in the literature for different attacks. The format of the attack (columns) means monitored pages x instances. TPR means true positive rate and in essence captures how often the attack correctly classifies a monitored page in the testing data as being monitored. We’ll look closer at the statistics involved in website fingerprinting in a later post.

You are being monitored

For example, we see from the table that the Wa-kNN attack correctly classifies a visited monitor page in the testing data close to 95% of the time. Scary uh? To make matters worse, modern attacks like Wa-kNN needs only seconds of CPU time to perform their attacks, unlike older attacks like Wa-OSAD that needed 4937 CPU hours for the attack in the table (Wa-kNN needed 156 CPU seconds). This means that the attacker can easily perform website fingerprinting with little more than a regular desktop computer.

Moving forward

As we have seen, state-of-the-art website fingerprinting attacks are very accurate in the closed world setting and a real threat to Tor users. Both the Tor and research community spend time on investigating practical defenses and more accurate attacks. There are several open questions and/or disagreements though, such as:

How effective and realistic are website fingerprinting attacks in the real world? The real world includes both being in the open world and dealing with complexities like user’s browsing behavior and removing noise in the Tor traffic like file downloads.
We know of several effective defenses to protect against website fingerprinting attacks (more about that in a later post), but how can we make them more efficient and practical?

We hope to contribute to answering these questions. For those who wants to learn more about website fingerprinting attacks, Tao Wang’s PhD thesis is a great read.

Sources

The octopus representing the attacker is extracted from the work by Goran tek-en (CC BY-SA 3.0), which depicts the mission patch for NROL-39.
The onion representing tor relays is from the Tor project.
The table is from Tao Wang’s PhD thesis.