For our work on the effect of DNS on Tor’s anonymity we collected a significantly sized DNS dataset with five samples each from the Alexa top one million most popular websites on the Internet on April 15th 2016. The data was collected with Tor Browser 5.5.4 using tools from the DefecTor toolset.
Download the data
We make the following files available:
- PCAPs: alexa1mx5.tar.gz (7.4 GiB)
SHA-256: 100b2081ca194571206ba02d88459982baf7b0584b3dd3246c0c0413048ddb5e - Extracted textfiles: alexa1mx5-extracted.tar.gz (590 MiB)
SHA-256: 7361a816f24b34b1f8d9f26e9fa5a403622ce3b4b401a101f4b41cf1d6705ffc - Alexa top 1,000,000 file: top-1m.csv (22 MiB)
SHA-256: 65f8d31a61164825900d50296de35bfbeaac405c9227abf5680ff61c404aa933 - IPv4 addresses for Cloudflare: ips-v4 (0.2 KiB)
SHA-256: 3a69b705b18bd630e748165183a8158220b755fa9026b7db967cd9769410e606
How the data was collected
Our collection method uses a fresh copy of Tor Browser for each site visit without using tor. In other words, we configured Tor Browser to not use the Tor network but instead connect directly from our university network. We did this to avoid issues like Cloudflare CAPTCHAs and IP-blacklists containing exits from the Tor network. Please note that the data was collected with Tor Browser 5.5.4, newer versions might require further modifications to, e.g., prevent unwanted network traffic or even to run Tor Browser in a container.
First, install the relevant tools using Go:
Download an Alexa file with top sites and run:
The server will instruct workers to collect in total
five samples of the sites in top-1m.csv
, using up to 30 seconds per site
visit, and store the results in the data folder with the suffix .pcap
.
By default, the server listens on port 55555 on all interfaces.
Download a fresh copy of Tor Browser and
extract it. Open
Browser/TorBrowser/Data/Browser/profile.default/preferences/
and put the
following at the bottom of extension-overrides.js
:
Launch Tor Browser and follow this guide from Mozilla.
Nex, download the latest release of
dumb-init.
We need a minimal init system to clean up the many processes we will be
creating in Docker.
Copy the following
into a new file named Dockerfile
(based on work by Jess Frazelle):
Build the docker container and start a worker:
Finally, to extract the DNS data from the resulting pcaps use the extractdns tool:
Where data is the folder the server stored the data in and results is the folder to store the extracted data in.
Wrapping up
The data collection for our DNS data largely mirrors how we created the
DefecTor WF dataset.
Note that when we collected this dataset, we ran the server in five rounds,
increasing the number of samples by one for each run starting from 1 sample.
This way we made sure that we spread-out our site visits to the same time.