For our work on the effect of DNS on Tor’s anonymity we collected a significantly sized website fingerprinting dataset with 100 samples of Alexa top 9,000 (monitored sites) and one sample each of Alexa top 909,000 (unmonitored). The data was collected with Tor Browser 5.5.4 using tools from the DefecTor toolset.

Download the data

We make the following files available:

  • Raw logfiles: alexa9kx100+900k.tar.gz (15 GiB)
    SHA-256: c137074752143f893dba8857b0be1544ba12a6c08d4b296e7f63089e365fcf19
  • Extracted cells, features and DNS requests: alexa9kx100+900k-dns+cells+feat.tar.gz (4.1 GiB)
    SHA-256: 2719475968afda4f36694fe9f84f9c1b1915db9ca440cf05b9a8361be55b8b05
  • Extracted features: alexa9kx100+900k-feat.tar.gz (817 MiB)
    SHA-256: 4cfb258d4d1b12698cfa4aa56114692c646ee59dc7dbb3eecdde988336c16970
  • Extracted features used in our paper: alexa1kx100+100k-feat.tar.gz (94 MiB)
    SHA-256: b7be02065cf20537683697cd083b26c2f299bb4ae5e089a58a2ba823132e8358
  • Alexa top 1,000,000 file: top-1m.csv (22 MiB)
    SHA-256: 65f8d31a61164825900d50296de35bfbeaac405c9227abf5680ff61c404aa933

How the data was collected

Our collection method ensures a fresh circuit and Tor Browser for each sample while caching the consensus to reduce the load on the network. Please note that the data was collected with Tor Browser 5.5.4, newer versions might require further modifications to, e.g., prevent unwanted network traffic or even to run Tor Browser in a container.

First, install the relevant tools using Go:

go get github.com/pylls/defector/cmd/{server,tbw}
go get github.com/pylls/defector/cmd/{torlogext,fext}

Download an Alexa file with top sites and run:

head -n 9000 top-1m.csv > top9k.csv
head -n 909000 top-1m.csv > top909k.csv

This creates two files for our monitored and unmonitored data. For collecting the monitored data, run:

server -f data -s 100 -t 60 -o .torlog top9k.csv

For collecting the unmonitored data, run:

server -f data -s 1 -t 60 -o .torlog top909k.csv

The server will instruct workers to collect -s samples of the sites in the specified files such as top9k.csv, using 60 seconds per site visit, and store the results in the data folder with the suffix .torlog. By default, the server listens on port 55555 on all interfaces.

Download a fresh copy of Tor Browser and extract it. Open Browser/TorBrowser/Data/Browser/profile.default/preferences/ and put the following at the bottom of extension-overrides.js:

user_pref("app.update.enabled", false);
user_pref("extensions.torlauncher.prompt_at_startup", false);
user_pref("datareporting.healthreport.nextDataSubmissionTime", "1759373924100");
user_pref("datareporting.policy.firstRunTime", "1759287524100");
user_pref("extensions.torbutton.lastUpdateCheck", "1759287542.7");
user_pref("extensions.torbutton.show_slider_notification", false);
user_pref("extensions.torbutton.updateNeeded", false);
user_pref("extensions.torbutton.versioncheck_url", "");
user_pref("extensions.torbutton.versioncheck_enabled", false);

Open Browser/TorBrowser/Data/Tor/torrc and add:

LogTimeGranularity 1
UseEntryGuards 0

Launch Tor Browser and follow this guide from Mozilla. Next, we need to build a custom tor binary for Tor Browser that logs all incoming and outgoing cells using Tor’s logging framework. First, get the tor source code:

git clone https://git.torproject.org/tor.git

Follow the instructions in the INSTALL file to build tor. Once you can build tor, open src/or/relay.c, find relay_send_command_from_edge_ around line 580 and add:

log_notice(LD_GENERAL, "OUTGOING CIRC %u STREAM %d COMMAND %s(%d) length %zu",
  circ->n_circ_id, stream_id, relay_command_to_string(relay_command), relay_command, payload_len);

Next, still in src/or/relay.c, find connection_edge_process_relay_cell around line 1433 and add:

log_notice(LD_GENERAL, "INCOMING CIRC %u STREAM %d COMMAND %s(%d) length %d",
  circ->n_circ_id, rh.stream_id, relay_command_to_string(rh.command), rh.command, rh.length);

Finally, if you want to log resolved DNS, open src/or/addressmap.c, find client_dns_set_addressmap_impl around line 665 and add:

log_notice(LD_GENERAL, "DNSRESOLVED %s ip %s ttl %d", address, name, ttl);

Run make Copy src/or/tor to Browser/TorBrowser/tor. Download the latest release of dumb-init. Copy the following into a new file named Dockerfile (based on work by Jess Frazelle):

FROM debian:jessie
MAINTAINER Tobias Pulls <tobias.pulls@kau.se>

RUN apt-get update && apt-get install -y \
	xvfb \
	libpcap-dev \
	libasound2 \
	libdbus-glib-1-2 \
	libgtk2.0-0 \
	libxrender1 \
	libxt6 \
	xz-utils \
  xauth \
  psmisc \
	--no-install-recommends

COPY dumb-init*_amd64.deb /
RUN dpkg -i dumb-init*.deb
RUN rm dumb-init*.deb && apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

ENV HOME /home/user
ENV LANG C.UTF-8

# create user (start-tor-browser.sh prevents us from running as root)
RUN useradd --create-home --home-dir $HOME user

COPY tbw $HOME/
COPY tor-browser_en-US $HOME/tor-browser_en-US

RUN chown -R user:user $HOME \
	&& chmod +x $HOME/tbw \
  && setcap 'CAP_NET_RAW+eip CAP_NET_ADMIN+eip' $HOME/tbw

WORKDIR $HOME
USER user
ENTRYPOINT ["dumb-init", "--"]

Build the docker container and start a worker:

docker build -t pulls/worker  .
docker run --privileged -d pulls/worker ./tbw <IP:port>

Finally, to extract the data from the resulting torlog-files use the torlogext and fext tools:

torlogext -o results/ data/
fext results

The torlogext file generates cell-files with the format used by Wang et al., and the fext tool extracts features for Wa-kNN.

Wrapping up

When collecting our data we used between 200-400 Docker containers (depending on hardware availability) for several days. The server is not resource constrained and we ran it on a commodity laptop. The clients were run using CoreOS on 1U blades with an Intel(R) Xeon(R) CPU E5-2650@ 2.00GHz and 62GiB RAM.