We’ve updated our data gathering tools to be tailored for gathering datasets with pluggable transports for Tor. While the steps below will be specific to basket2, it should be self-evident how to modify them for any other pluggable transport supported by Tor. We use primarily Go for our tools and Docker containers for ease of deployment.

First, download our repository using Go:

go get github.com/pylls/thebasketcase

The repository contains Go executables and Docker images we’ll need. We’re going to run a basket2 bridge, then prepare Tor Browser, launch a gatherserver that collects our dataset, and finally run serveral gatherclients that executes Tor Browser, connects to our basket2 bridge, visits sites provided by the gatherserver and then sends the results back. Let’s get started!

Run a basket2 bridge

The basket2 bridge needs to run on a machine reachable by the gatherclients we’ll Get and build our basket2 fork:

go get github.com/pylls/basket2proxy

In a terminal, navigate to the basket2 Docker image folder at thebasketcase/docker/basket2 and copy the basket2proxy executable here (created by go get github.com/pylls/basket2proxy and placed in your $GOPATH/bin folder). Next, build the Docker image:

docker build -t pulls/basket2bridge .

Run a container with the bridge, spawning an interactive shell:

docker run -p <IP>:<Port>:11111 -i -t pulls/basket2bridge bash

In the container, start Tor:

root@d07b96f4057c:/# service tor start
[ ok ] Starting tor daemon...done.
root@d07b96f4057c:/# cd /var/lib/tor/
root@85252b2e7a49:/var/lib/tor# cat fingerprint
Unnamed 5DD80B4AC2F718F1D8CACDAD1FD88644950A52B6
root@85252b2e7a49:/var/lib/tor# cat pt_state/basket2_bridgeline.txt
# basket2 torrc client bridge line
#
# This file is an automatically generated bridge line based on
# the current basket2proxy configuration.  EDITING IT WILL HAVE
# NO EFFECT.
#
# Before distributing this Bridge, edit the placeholder fields
# to contain the actual values:
#  <IP ADDRESS>  - The public IP address of your obfs4 bridge.
#  <PORT>        - The TCP/IP port of your obfs4 bridge.
#  <FINGERPRINT> - The bridge's fingerprint.

Bridge basket2 <IP ADDRESS>:<PORT> <FINGERPRINT> basket2params=0:0001:QiNZ5eqnrzPOXv4NyQ3Og5UntIpClPX6GC4c4Cq/I0Y

Assuming the server IP and port above was 192.168.60.184:11111, and the fingerprint and basket2_bridgeline as shown above, the following is the bridgeline we’ll use later to connect to the bridge:

Bridge basket2 192.168.60.184:11111 5DD80B4AC2F718F1D8CACDAD1FD88644950A52B6 basket2params=0:0001:QiNZ5eqnrzPOXv4NyQ3Og5UntIpClPX6GC4c4Cq/I0Y

Prepare Tor Browser

This is by far the most annoying step. Follow our guide on how to configure Tor Browser to generate less network noise. Note that versions after Tor Browser 6.0.6 might require more work. Use the basket2 bridgeline from above. You can find an example torrc for basket2 if you get stuck on that part further below. Next, create a modified tor binary as described in our DefecTor experiments. When creating the modified tor binary we used a Debian Jessie VM as a build machine. Remember to copy src/or/tor to Browser/TorBrowser/tor. Make sure to check that Tor Browser still works after making all the changes and replacing the tor binary.

Run gatherserver

The gatherserver binary should already be avilable in your $GOPATH/bin folder. Note that all captured traffic traces will be stored in the gatherserver so have ample diskspace available. Copy the gatherserver binary to a working directory. Create a torrc file with the following contents:

LogTimeGranularity 1
UseBridges 1
Bridge basket2 192.168.60.184:11111 5DD80B4AC2F718F1D8CACDAD1FD88644950A52B6 basket2params=0:0001:QiNZ5eqnrzPOXv4NyQ3Og5UntIpClPX6GC4c4Cq/I0Y
ClientTransportPlugin basket2 exec ./TorBrowser/Tor/PluggableTransports/basket2proxy -enableLogging=true -logLevel DEBUG -paddingMethods {{.Method}}

Replace the third line with your basket2 bridgeline from your basket2 server. Download a fresh Alexa top-1m file from Amazon and unzip. Run:

./gatherserver -h

To see all options. For example, if you run:

./gatherserver -monitored 100 -samples 100 -unmonitored 10000 top-1m.csv

The server will distribute to all connecting gatherclients (which we run next) work that results in 100 monitored sites with 100 samples each from Alexa top [1,100] and 10,000 randomly selected unmonitored sites from Alexa (100, 1000000]. The data will be collected for all default basket2 methods and the data will be stored in data/<method>/, with one sub-folder per method. The server can be freely restarted and will resume based on what work is already on disk.

Run gatherclients

In a terminal, navigate to the gatherclient Docker image folder at thebasketcase/docker/gatherclient. Copy the Tor Browser directory that you modified earlier there, and the gatherclient binary from $GOPATH/bin. Next, build the Docker image:

docker build -t pulls/gatherclient .

If you’re lazy (like me), then you can use the run.sh and clean.sh scripts to run and clean up docker containers with gatherclients in them. In run.sh:

#!/bin/sh
for ((n=0;n<$1;n++)) do
  docker run --privileged -d pulls/gatherclient ./gatherclient <IP>:55555
done

Replace <IP> with the IP-address to your gatherserver (that listens on port 55555). Then, to run 10 clients, just:

 ./run.sh 10 

If you look at the terminal output of the gatherserver you should see the workers count increasing and eventually (after a warmup browse) data being gathered. On failure, gatherclient will write errors (inlcuding forwarding errors to stdout and stderr from tor) to stdout. You can view them with docker ps, docker logs, docker stats etc. By far the most annoying error to debug for us has been related to Tor Browser and tor not properly starting. By default, each gatherclient will attempt to browse to a site five times before giving up, which delays errors in the docker log.

Summary

Once the gatherserver is done, we have 120,000 samples in total with 20,000 samples per method in six subfolders located at data/<method>/. The next post will cover how we use our analysis tools on the data.