The basket case: second evaluation - The HOT research project

This is our second evaluation of the next generation pluggable transport from the Tor project: basket2. You can find the first evaluation here. Please note that the focus of the evaluation is website fingerprinting attacks which is not the primary purpose of pluggable transports to mitigate, and that this is ongoing work.

Gathering data

We evaluate the six padding methods of basket2:

Null no padding (only for testing).
Obfs4Burst minimal padding for bursts.
Obfs4BurstIAT adds a random delay after a burst.
Obfs4PacketIAT breaks traffic into small, randomly sized, packets and adds random delay.
Tamaraw the website fingerprint defense by Cai et al. with tweaks by Yawning and parameters from Wang’s PhD thesis.
TamarawBulk like Tamaraw, but tweaked for bulk transfers.

For each mode, we gathered an open world dataset of 100 monitored websites with 100 instances each and 10,000 unmonitored sites. To create our datasets we used Docker and Tor Browser 6.0.5. Our tools are available at github.com/pylls/thebasketcase. They will be the focus of later updates.

Compared to our prior evaluation, we significantly improved the collection method, resulting in less noise. You can find the datasets in the form of extracted features in the Wa-kNN format here.

Evaluation

We evaluate each basket2 method using Wa-kNN with different values of k, perform 10-fold cross-validation, and calculate the precision and recall for each combination of attack and basket2 method. The figure below shows the precision of the different methods for varying values of k:

Wa-kNN precision

First, we see that the methods designed to mitigate website fingerprinting attacks—Tamaraw and TamarawBulk—offer the best defense. For all other methods the difference is much smaller, with the exception of Obfs4PacketIAT that consistently provides the highest precision. The next figure shows recall:

Wa-kNN recall

Tamaraw and TamarawBulk once again show excellent protection. There is little difference between Null, Obfs4Burst, and Obfs4BurstIAT while Obfs4PacketIAT shows significantly better recall. With Obfs4BurstIAT and Obfs4PacketIAT highest, it appears likely that introducing delay between bursts and packets is the key culprit. Possibly, the delay and relatively limited padding (compared to the Tamaraw methods) emphasizes features related to bursts. To be investigated.