The basket case: first evaluation - The HOT research project

This post presents an evaluation of a next generation pluggable transport from the Tor project: basket2. The focus of the evaluation is website fingerprinting attacks which is not the primary purpose of pluggable transports to mitigate. There is a clear connection between censorship circumvention—the primary purpose of pluggable transports—and website fingerprinting attacks though, namely resistance to traffic analysis. Please note that this is ongoing work and as such is subject to updates in the coming months.

Gathering data

We evaluate the six padding modes of basket2:

Null no padding (only for testing).
Obfs4Burst minimal padding for bursts.
Obfs4BurstIAT adds a random delay after a burst.
Obfs4PacketIAT breaks traffic into small, randomly sized, packets and adds random delay.
Tamaraw the website fingerprint defense by Cai et al. with tweaks by Yawning and parameters from Wang’s PhD thesis.
TamarawBulk like Tamaraw, but tweaked for bulk transfers.

For each mode, we gathered an open world dataset of 100 monitored websites with 100 instances each and 10,000 unmonitored sites. To create our datasets we used Docker and Tor Browser 6.0.5. Our tools are available at github.com/pylls/thebasketcase, including:

a Docker image for running basket2,
a Docker image for running Tor Browser and dumping all network traffic, and
server-client tools for orchestrating data gathering.

Please see respective READMEs in the repository for instructions (still messy). We’ll cover these steps in more detail in a later post once instructions are updated and we refine our data collection approach.

Our datasets are gathered as raw pcaps starting from the launch of Tor browser: we did not prune our datasets further, neglecting issues like CloudFlare CAPTCHAs, outliers, control cells, and localized domains. Further, there appears to be a bug causing bridge users to ignore cached concensus files. Expect significantly better results with more care in this area than what you find below. Since our focus is on the difference between the different basket2 modes and some attacks, the dirty datasets will suffice for now. You can find the datasets in the form of extracted features in the Wa-kNN format here.

Evaluation

We evaluate each basket2 mode against three attacks:

Wa-kNN with k=1 for the highest possible recall. We covered Wa-kNN in a prior post.
DefecTor (ctw) - uses observed DNS traffic to close the world on the Wa-kNN attack.
DefecTor (hp) - confirms the classification of Wa-kNN with DNS traffic.

Read more about DefecTor attacks at [1, 2]. Note that comparing Wa-kNN to DefecTor attacks is in a way like comparing apples and oranges: the DefecTor attacks assume that the attacker can observe a significant portion of DNS traffic generated by Tor exits in addition to being in a position to do a website fingerprinting attack. Below, we assume that the attacker can observe all DNS traffic exiting the Tor network. The Tor project is up-front about its limitations and we already know that such a powerful attacker is outside of Tor’s threat model. We used the DefecTor tools at https://github.com/pylls/defector.

Ok, that said, now let’s continue with the evaluation. We perform 10-fold cross-validation, and calculate the precision and recall for each combination of attack and basket2 mode. We start with looking at recall:

			Recall
Attack/Basket2	Null	Obfs4Burst	Obfs4BurstIAT	Obfs4PacketIAT	Tamaraw	TamarawBulk
Wa-kNN (k=1)	0.193	0.284	0.184	0.196	0.012	0.013
DefecTor (ctw)	0.256	0.358	0.399	0.233	0.057	0.409
DefecTor (hp)	0.184	0.270	0.175	0.187	0.012	0.012

We see that for the Wa-kNN (k=1) attack, the Tamaraw padding modes—designed to mitigate website fingerprinting attacks—are highly effective. On the other hand, the Obfs4Burst mode significantly increases (+47%) recall over Null padding. For the DefecTor (hp) attack, the same analysis as for Wa-kNN applies. DefecTor (ctw) on the other hand sees a significant increase over Null padding for Obfs4Burst (+39%), Obfs4BurstIAT (+55%), and TamarawBulk (+59%). Note that DefecTor (ctw) has close to five times as high recall as Wa-kNN against Tamaraw.

Next up is precision:

			Precision
Attack/Basket2	Null	Obfs4Burst	Obfs4BurstIAT	Obfs4PacketIAT	Tamaraw	TamarawBulk
Wa-kNN (k=1)	0.179	0.247	0.183	0.187	0.014	0.013
DefecTor (ctw)	0.867	0.940	0.848	0.867	0.560	0.608
DefecTor (hp)	0.912	0.956	0.942	0.905	0.529	0.647

Once again, for Wa-kNN, we see that Obfs4Burst is advantageous over Null padding and that both modes of Tamaraw are highly effective defenses. For DefecTor, both attacks are highly precise against all non-Tamaraw modes. Further, the Obfs4Burst mode is the most precise mode. For Tamaraw, we see that DefecTor classifications are still (barely) more likely to correct than wrong with a precision over 50%. That TamarawBulk is worse than Tamaraw in this case is not surprising since our data consists not of bulk transfers but browsing websites.

Summary

Tamaraw is an effective website fingerprinting defense. Surprisingly, the Obfs4Burst mode improves the effectiveness of Wa-kNN. This might be an artefact of our messy datasets and warrants further investigation. DefecTor attacks are highly precise attacks, retaining over 50% precision even in the face of Tamaraw. We know that our dataset is messy, and next we’ll improve out data gathering approach to better fit basket2 and re-run the evaluation.