We’ve updated our data gathering tools to be tailored for gathering datasets with pluggable transports for Tor. While the steps below will be specific to basket2, it should be self-evident how to modify them for any other pluggable transport supported by Tor. We use primarily Go for our tools and Docker containers for ease of deployment.
First, download our repository using Go:
The repository contains Go executables and Docker images we’ll need. We’re going to run a basket2 bridge, then prepare Tor Browser, launch a gatherserver that collects our dataset, and finally run serveral gatherclients that executes Tor Browser, connects to our basket2 bridge, visits sites provided by the gatherserver and then sends the results back. Let’s get started!
Run a basket2 bridge
The basket2 bridge needs to run on a machine reachable by the gatherclients we’ll Get and build our basket2 fork:
In a terminal, navigate to the basket2 Docker image folder at
thebasketcase/docker/basket2
and copy the basket2proxy executable here
(created by go get github.com/pylls/basket2proxy
and placed in your
$GOPATH/bin
folder). Next, build the Docker image:
Run a container with the bridge, spawning an interactive shell:
In the container, start Tor:
Assuming the server IP and port above was 192.168.60.184:11111, and the fingerprint and basket2_bridgeline as shown above, the following is the bridgeline we’ll use later to connect to the bridge:
Prepare Tor Browser
This is by far the most annoying step.
Follow our guide on how to
configure Tor Browser to generate less network noise. Note that versions after
Tor Browser 6.0.6 might require more work. Use the basket2 bridgeline from
above. You can find an example torrc
for basket2 if you get stuck on that
part further below.
Next, create a modified tor binary as described in our DefecTor experiments. When
creating the modified tor binary we used a Debian Jessie VM as a build machine.
Remember to copy src/or/tor
to Browser/TorBrowser/tor
. Make sure to check
that Tor Browser still works after making all the changes and replacing the
tor binary.
Run gatherserver
The gatherserver
binary should already be avilable in your $GOPATH/bin
folder. Note that all captured traffic traces will be stored in the
gatherserver so have ample diskspace available. Copy the gatherserver binary
to a working directory. Create a torrc
file with the following contents:
Replace the third line with your basket2 bridgeline from your basket2 server. Download a fresh Alexa top-1m file from Amazon and unzip. Run:
To see all options. For example, if you run:
The server will distribute to all connecting gatherclients (which we run next)
work that results in 100 monitored sites with 100 samples each from Alexa top
[1,100] and 10,000 randomly selected unmonitored sites from Alexa (100,
1000000]. The data will be collected for all default basket2 methods and the
data will be stored in data/<method>/
, with one sub-folder per method.
The server can be freely restarted and will resume based on what work is
already on disk.
Run gatherclients
In a terminal, navigate to the gatherclient Docker image folder at
thebasketcase/docker/gatherclient
. Copy the Tor Browser directory
that you modified earlier there, and the gatherclient binary from
$GOPATH/bin
. Next, build the Docker image:
If you’re lazy (like me), then you can use the run.sh
and clean.sh
scripts
to run and clean up docker containers with gatherclients in them.
In run.sh
:
Replace <IP>
with the IP-address to your gatherserver (that listens on port
55555). Then, to run 10 clients, just:
If you look at the terminal output of the gatherserver
you should see the
workers count increasing and eventually (after a warmup browse) data being
gathered. On failure, gatherclient
will write errors (inlcuding forwarding
errors to stdout and stderr from tor) to stdout. You can view them with
docker ps
, docker logs
, docker stats
etc. By far the most annoying
error to debug for us has been related to Tor Browser and tor not properly
starting. By default, each gatherclient
will attempt to browse to a site
five times before giving up, which delays errors in the docker log.
Summary
Once the gatherserver is done, we have 120,000 samples in total with 20,000
samples per method in six subfolders located at data/<method>/
.
The next post will cover how we use our analysis tools on the data.