The basket case: evaluation tools - The HOT research project

At the end of gathering data in our previous post, we ended up with a folder of collected pcaps in the format data/<method>/. Each subfolder has a dataset of 100 monitored sites with 100 instances each and 10,000 unmonitored sites. Next, we’ll use some processing and analysis tools to dig deeper. Get the tools by:

go get github.com/pylls/thebasketcase

Processing

First we need to extract cells from the pcaps. Cells are descriptions of pcaps in the following format:

For each line, the first value is the relative time between packets and the second value is the direction (positive for transmitting, negative for receiving). This is the same format as used by other website fingerprinting tools, like those of Wang et al. Note that, while called cell extraction, our basket case tools operate on the packet level. Past research shows that the difference between cells and packets are negligible on WF performance [0]. Also, if anything, operating on the packet level is more realistic.

To extract the cells from our data, use the extractcells tool. The -o flag specifies the output directory, and the -bridge flag the IP-address of the bridge the client connected to. With the bridge’s IP-address we can filter out other network traffic on the network that might accidentally have ended up in the pcaps (even though we do our best to not collect this in clients). The first argument is the folder with the pcaps.

./extractcells -bridge 192.168.60.163 -o data-cells data/

With cells extracted the next step is attack specific. In our case, we continue using Wa-kNN and our extractfeatures tool has a -o flag specifies the output directory.

./extractfeatures -o data-feat data-cells/

Analysis

With the Wa-kNN features extracted we can move on to analysis using go-knn. There are three mandatory flags: -sites, -instances and -open that specify what one would expect. The first argument is also mandatory and it is the folder with the extracted features.

./go-knn -sites 100 -instances 100 -open 10000 data-feat/

2016/11/24 15:03:31 found 6 folder(s) with work
2016/11/24 15:03:31 starting with work Null
2016/11/24 15:03:31 	attempting to read WF features...
2016/11/24 15:03:34 	read 100 sites with 100 instances (in total 10000)
2016/11/24 15:03:34 	read 10000 sites for open world
2016/11/24 15:05:54 	determined global kNN-weights for all folds
2016/11/24 15:05:54 	starting fold 1/10
				testing 2000/2000 (24 workers)
2016/11/24 15:06:02 	starting fold 2/10
				testing 2000/2000 (24 workers)
2016/11/24 15:06:09 	starting fold 3/10
				testing 2000/2000 (24 workers)
2016/11/24 15:06:16 	starting fold 4/10
				testing 2000/2000 (24 workers)
2016/11/24 15:06:23 	starting fold 5/10
				testing 2000/2000 (24 workers)
2016/11/24 15:06:31 	starting fold 6/10
				testing 2000/2000 (24 workers)
2016/11/24 15:06:38 	starting fold 7/10
				testing 2000/2000 (24 workers)
2016/11/24 15:06:45 	starting fold 8/10
				testing 2000/2000 (24 workers)
2016/11/24 15:06:52 	starting fold 9/10
				testing 2000/2000 (24 workers)
2016/11/24 15:07:00 	starting fold 10/10
				testing 2000/2000 (24 workers)
2016/11/24 15:07:07 starting with work Obfs4Burst
2016/11/24 15:07:07 	attempting to read WF features...
2016/11/24 15:07:10 	read 100 sites with 100 instances (in total 10000)
2016/11/24 15:07:10 	read 10000 sites for open world
.....

Above you also see some example output to stdout. go-knn calculates weights in parallel, reports progress to stdout, and uses as all CPU cores for testing (what usually takes the longest on bigger datasets). Beyond printing the final results to stdout (not shown above), four files are written to the working directory:

100x100+10000-precision.csv a CSV file of the precision of all subfolders (methods) for different k-values in Wa-kNN.
100x100+10000-recall.csv a CSV file of the recall of all subfolders (methods) for different k-values in Wa-kNN.
100x100+10000.log a complete output log (transcript) of the analysis.
100x100+10000.weights the calculated weights with WLLCC of Wa-kNN for all subfolders and folds.

The names of the files are derived from the flags to go-knn. The log file also contains a detailed breakdon of the different classification breakdowns we covered in an earlier update.

References

[0] Website Fingerprinting at Internet Scale.