For our work on the effect of DNS on Tor’s anonymity we collected a significantly sized website fingerprinting dataset with 100 samples of Alexa top 9,000 (monitored sites) and one sample each of Alexa top 909,000 (unmonitored). The data was collected with Tor Browser 5.5.4 using tools from the DefecTor toolset.
Download the data
We make the following files available:
- Raw logfiles: alexa9kx100+900k.tar.gz (15 GiB)
SHA-256: c137074752143f893dba8857b0be1544ba12a6c08d4b296e7f63089e365fcf19 - Extracted cells, features and DNS requests: alexa9kx100+900k-dns+cells+feat.tar.gz (4.1 GiB)
SHA-256: 2719475968afda4f36694fe9f84f9c1b1915db9ca440cf05b9a8361be55b8b05 - Extracted features: alexa9kx100+900k-feat.tar.gz (817 MiB)
SHA-256: 4cfb258d4d1b12698cfa4aa56114692c646ee59dc7dbb3eecdde988336c16970 - Extracted features used in our paper: alexa1kx100+100k-feat.tar.gz (94 MiB)
SHA-256: b7be02065cf20537683697cd083b26c2f299bb4ae5e089a58a2ba823132e8358 - Alexa top 1,000,000 file: top-1m.csv (22 MiB)
SHA-256: 65f8d31a61164825900d50296de35bfbeaac405c9227abf5680ff61c404aa933
How the data was collected
Our collection method ensures a fresh circuit and Tor Browser for each sample while caching the consensus to reduce the load on the network. Please note that the data was collected with Tor Browser 5.5.4, newer versions might require further modifications to, e.g., prevent unwanted network traffic or even to run Tor Browser in a container.
First, install the relevant tools using Go:
Download an Alexa file with top sites and run:
This creates two files for our monitored and unmonitored data. For collecting the monitored data, run:
For collecting the unmonitored data, run:
The server will instruct workers to collect -s
samples of the sites in
the specified files such as top9k.csv
, using 60 seconds per site visit,
and store the results in the data folder with the suffix .torlog
.
By default, the server listens on port 55555 on all interfaces.
Download a fresh copy of Tor Browser and
extract it. Open
Browser/TorBrowser/Data/Browser/profile.default/preferences/
and put the
following at the bottom of extension-overrides.js
:
Open Browser/TorBrowser/Data/Tor/torrc
and add:
Launch Tor Browser and follow this guide from Mozilla. Next, we need to build a custom tor binary for Tor Browser that logs all incoming and outgoing cells using Tor’s logging framework. First, get the tor source code:
Follow the instructions in the INSTALL
file to build tor.
Once you can build tor, open src/or/relay.c
, find
relay_send_command_from_edge_
around line 580 and add:
Next, still in src/or/relay.c
, find
connection_edge_process_relay_cell
around line 1433 and add:
Finally, if you want to log resolved DNS, open src/or/addressmap.c
, find
client_dns_set_addressmap_impl
around line 665 and add:
Run make
Copy src/or/tor
to Browser/TorBrowser/tor
.
Download the latest release of
dumb-init. Copy the following
into a new file named Dockerfile
(based on work by Jess Frazelle):
Build the docker container and start a worker:
Finally, to extract the data from the resulting torlog-files use the torlogext and fext tools:
The torlogext file generates cell-files with the format used by Wang et al., and the fext tool extracts features for Wa-kNN.
Wrapping up
When collecting our data we used between 200-400 Docker containers (depending on hardware availability) for several days. The server is not resource constrained and we ran it on a commodity laptop. The clients were run using CoreOS on 1U blades with an Intel(R) Xeon(R) CPU E5-2650@ 2.00GHz and 62GiB RAM.