Computer Laboratory

Cambridge Cybercrime Centre: Description of available datasets

This page sets out, at a fairly high level, what datasets are currently available from the Cambridge Cybercrime Centre.

We'd be happy to answer questions about the detail -- where we are able to do so.

For information about the steps in the process for obtaining data from the Cybercrime Centre you should consult this page.

Reflected DDoS victims

We operate nearly 100 sensors in various locations around the world that record incoming UDP packets (in PCAP files in the first instance, but duplicates are summarised in text files).

The sensors respond to packets associated with scanning for 'reflectors' that are to be often used in distributed reflected amplified denial of service (DDoS) attacks -- and this means that our sensors are often called into play for these attacks, which means that we have a record of the victim IP.

Our dataset starts in March 2014, though the number of sensors varies over time. A high level description of our collection system and a summary of the data appears in our paper Daniel R. Thomas, Richard Clayton, and Alastair R. Beresford: "1000 days of UDP amplification DDoS attacks", APWG eCrime, 2017.

Note that the dataset is large (we have data on over 4 trillion packets) so you will need to think about whether your research should only be on a subset of the data.

Mirai scanning data

The Mirai malware scans for devices that it may be able to compromise by sending out distinctive TCP SYN packets. We collect these packets when they hit our sensors. We have data from about a dozen sensors from mid-November 2016 onwards, but significantly better coverage from a (circa) /16 after April 2017 and from a further (bit more than a) /14 from mid-October 2017 onwards.

Mirai malware (etc)

We operate honeypots that are specifically intended to collect Mirai malware (they also get a certain amount of bycatch -- copies of QBot variants etc.). We have around 15000 binaries (which is not as impressive as it sounds because a Mirai source file is usually compiled ten or more times for different CPU architectures and we collect all the variants we can).

Honeypot data

We have over 1 year of honeypot data for tcp/22 (SSH connections).

Underground forums

We "scrape" a number of publicly available underground forums where there is discussion of cybercrime and advertising of the results of cybercrime. Some of these forums have been operating for many years and we have now amassed a complete collection of posts (excluding those that have been actively deleted of course). Currently we have over 40 million posts, some dating back more than 10 years.

It is possible to use this data to determine both what has been posted about a particular cybercrime technique (and when) and also what some particular person (hidden behind a pseudonym of course) might have been posting about.

Blog spam

Our blog (https://lightbluetouchpaper.org) receives a large number (50+) of spam comments each day. We now have a collection of around 200K of these.

Although the comments are often just lists of URLs to pharma sites, they are sometimes socially engineered to try and encourage us to make them visible -- and from time to time the posters get confused and post their templates rather than the customised result.

Phishing URLs

We have a very substantial list of phishing URLs going back over 10 years. We obtain these URLs not only from the APWG, but also from other sources so that our list is probably one of the most extensive there is. That said, some of the URLs are on the list in error and this makes complicates experimental design. If you are considering using this dataset then we can assist by explaining its provenance in more detail.

Phishing websites

We plan to visit all phishing URLs and fetch the web pages found there, but this system is not currently in "production". If you are interested in this data then you should talk to us further.

Phishing emails

We have a dataset of phishing emails (sent to a small set of email addresses) from 2005 onwards. The dataset contains numerous duplicates so numbers are inexact, but contains up to 10000 emails per year until 2010 and around 1000 emails a year since then.

Advanced Fee Fraud (419 scam) emails

We have a dataset of Advanced Fee Fraud (sometimes called 419 scam) emails (sent to a small set of email addresses) from 2001 onwards. The dataset has duplicates so numbers are inexact, but contains around 75000 unique emails. For the past few years offers of loans have been included in this dataset.

Email spam

We have a dataset of "spam" email (sent to a small set of email addresses) dating from 2003 onwards (and indeed some spam email from the mid-1990s onwards as well). Numbers of emails vary dramatically from year to year (and there are large numbers of duplicates) but in recent times exceed 2000 emails a month. Note that for relevant periods phishing and Advanced Fee Fraud emails have been extracted into separate datasets.

Domain names

We have an archive of registered .com and .net and domain names dating back to September 2009. We are now archiving all the "new" TLDs.

IP address allocation

We maintain a database recording IP address allocations over time (so that it is simple to check on the usage of IP addresses on past dates.

...old datasets

We also have the relevant datasets for most of the cybercrime papers we have published over the past 10 years (on phishing, Ponzi schemes, IM worms etc.)

For information about the steps in the process for obtaining data from the Cybercrime Centre you should consult this page.