For our work on the effect of DNS on Tor’s anonymity we collected a significantly sized website fingerprinting dataset with 100 samples of Alexa top 9,000 (monitored sites) and one sample each of Alexa top 909,000 (unmonitored). The data was collected with Tor Browser 5.5.4 using tools from the DefecTor toolset.
Download the data
We make the following files available:
- Raw logfiles: alexa9kx100+900k.tar.gz (15 GiB)
SHA-256: c137074752143f893dba8857b0be1544ba12a6c08d4b296e7f63089e365fcf19 - Extracted cells, features and DNS requests: alexa9kx100+900k-dns+cells+feat.tar.gz (4.1 GiB)
SHA-256: 2719475968afda4f36694fe9f84f9c1b1915db9ca440cf05b9a8361be55b8b05 - Extracted features: alexa9kx100+900k-feat.tar.gz (817 MiB)
SHA-256: 4cfb258d4d1b12698cfa4aa56114692c646ee59dc7dbb3eecdde988336c16970 - Extracted features used in our paper: alexa1kx100+100k-feat.tar.gz (94 MiB)
SHA-256: b7be02065cf20537683697cd083b26c2f299bb4ae5e089a58a2ba823132e8358 - Alexa top 1,000,000 file: top-1m.csv (22 MiB)
SHA-256: 65f8d31a61164825900d50296de35bfbeaac405c9227abf5680ff61c404aa933
How the data was collected
Our collection method ensures a fresh circuit and Tor Browser for each sample while caching the consensus to reduce the load on the network. Please note that the data was collected with Tor Browser 5.5.4, newer versions might require further modifications to, e.g., prevent unwanted network traffic or even to run Tor Browser in a container.
First, install the relevant tools using Go:
go get github.com/pylls/defector/cmd/{server,tbw}
go get github.com/pylls/defector/cmd/{torlogext,fext}
Download an Alexa file with top sites and run:
head -n 9000 top-1m.csv > top9k.csv
head -n 909000 top-1m.csv > top909k.csv
This creates two files for our monitored and unmonitored data. For collecting the monitored data, run:
server -f data -s 100 -t 60 -o .torlog top9k.csv
For collecting the unmonitored data, run:
server -f data -s 1 -t 60 -o .torlog top909k.csv
The server will instruct workers to collect -s
samples of the sites in
the specified files such as top9k.csv
, using 60 seconds per site visit,
and store the results in the data folder with the suffix .torlog
.
By default, the server listens on port 55555 on all interfaces.
Download a fresh copy of Tor Browser and
extract it. Open
Browser/TorBrowser/Data/Browser/profile.default/preferences/
and put the
following at the bottom of extension-overrides.js
:
user_pref("app.update.enabled", false);
user_pref("extensions.torlauncher.prompt_at_startup", false);
user_pref("datareporting.healthreport.nextDataSubmissionTime", "1759373924100");
user_pref("datareporting.policy.firstRunTime", "1759287524100");
user_pref("extensions.torbutton.lastUpdateCheck", "1759287542.7");
user_pref("extensions.torbutton.show_slider_notification", false);
user_pref("extensions.torbutton.updateNeeded", false);
user_pref("extensions.torbutton.versioncheck_url", "");
user_pref("extensions.torbutton.versioncheck_enabled", false);
Open Browser/TorBrowser/Data/Tor/torrc
and add:
LogTimeGranularity 1
UseEntryGuards 0
Launch Tor Browser and follow this guide from Mozilla. Next, we need to build a custom tor binary for Tor Browser that logs all incoming and outgoing cells using Tor’s logging framework. First, get the tor source code:
git clone https://git.torproject.org/tor.git
Follow the instructions in the INSTALL
file to build tor.
Once you can build tor, open src/or/relay.c
, find
relay_send_command_from_edge_
around line 580 and add:
log_notice(LD_GENERAL, "OUTGOING CIRC %u STREAM %d COMMAND %s(%d) length %zu",
circ->n_circ_id, stream_id, relay_command_to_string(relay_command), relay_command, payload_len);
Next, still in src/or/relay.c
, find
connection_edge_process_relay_cell
around line 1433 and add:
log_notice(LD_GENERAL, "INCOMING CIRC %u STREAM %d COMMAND %s(%d) length %d",
circ->n_circ_id, rh.stream_id, relay_command_to_string(rh.command), rh.command, rh.length);
Finally, if you want to log resolved DNS, open src/or/addressmap.c
, find
client_dns_set_addressmap_impl
around line 665 and add:
log_notice(LD_GENERAL, "DNSRESOLVED %s ip %s ttl %d", address, name, ttl);
Run make
Copy src/or/tor
to Browser/TorBrowser/tor
.
Download the latest release of
dumb-init. Copy the following
into a new file named Dockerfile
(based on work by Jess Frazelle):
FROM debian:jessie
MAINTAINER Tobias Pulls <tobias.pulls@kau.se>
RUN apt-get update && apt-get install -y \
xvfb \
libpcap-dev \
libasound2 \
libdbus-glib-1-2 \
libgtk2.0-0 \
libxrender1 \
libxt6 \
xz-utils \
xauth \
psmisc \
--no-install-recommends
COPY dumb-init*_amd64.deb /
RUN dpkg -i dumb-init*.deb
RUN rm dumb-init*.deb && apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
ENV HOME /home/user
ENV LANG C.UTF-8
# create user (start-tor-browser.sh prevents us from running as root)
RUN useradd --create-home --home-dir $HOME user
COPY tbw $HOME/
COPY tor-browser_en-US $HOME/tor-browser_en-US
RUN chown -R user:user $HOME \
&& chmod +x $HOME/tbw \
&& setcap 'CAP_NET_RAW+eip CAP_NET_ADMIN+eip' $HOME/tbw
WORKDIR $HOME
USER user
ENTRYPOINT ["dumb-init", "--"]
Build the docker container and start a worker:
docker build -t pulls/worker .
docker run --privileged -d pulls/worker ./tbw <IP:port>
Finally, to extract the data from the resulting torlog-files use the torlogext and fext tools:
torlogext -o results/ data/
fext results
The torlogext file generates cell-files with the format used by Wang et al., and the fext tool extracts features for Wa-kNN.
Wrapping up
When collecting our data we used between 200-400 Docker containers (depending on hardware availability) for several days. The server is not resource constrained and we ran it on a commodity laptop. The clients were run using CoreOS on 1U blades with an Intel(R) Xeon(R) CPU E5-2650@ 2.00GHz and 62GiB RAM.