Cambridge Cybercrime Centre: Description of available datasets

This page sets out, at a fairly high level, what datasets are currently available from the Cambridge Cybercrime Centre.

We'd be happy to answer questions about the detail -- where we are able to do so.

For information about the steps in the process for obtaining data from the Cybercrime Centre you should consult this page.

Underground forums

We "scrape" a number of publicly available underground forums where there is discussion of cybercrime and advertising of the results of cybercrime. Some of these forums have been operating for many years and we have now amassed a complete collection of posts (excluding those that have been actively deleted of course). Currently we have over 100 million posts, some dating back more than 10 years.

It is possible to use this data to determine both what has been posted about a particular cybercrime technique (and when) and also what some particular person (hidden behind a pseudonym of course) might have been posting about.

Extremist forums

We are expanding our forum collection to include material from "extremist" forums. Although there are some cybercrime aspects to this material it will mainly be of interest to those who are studying hate groups, extremism and radicalisation. We will shortly have more than 40 million posts.

We have also been collecting from a range of "Incel" (INvoluntary CELibates) forums. These forums support online subcultures, where members are unable to find a romantic partner despite desiring one. Extremist thoughts and opinions are commonly found on these forums. Our dataset already holds more than 7 million posts and 700,000 threads and it is being scraped in an on-going manner.

Underground marketplaces

We do not currently scrape any underground marketplace websites, but plan to expand into this area soon. However, one of the underground forums that we do scrape has introduced a service for processing "contracts" and from this we collect a range of valuable information such as the nature of the goods and services being exchanged, maker/taker obligations, contract values, agreement term and reputation ratings of the parties involved. It may also contain payment details, including bitcoin wallets and transaction hashes. This is a ground-truth dataset, which can be used to understand part of the underground economy and its underlying social network. The dataset contains roughly 180,000 contracts at present and it is being collected on a regular basis.

Defaced websites

Some people choose to boast about their hacking ability by breaking into websites, defacing pages and then publishing details of the defaced page online. We are building a dataset of these boasts ... we currently have about 550,000 sets of details (notifier, location, IP address, domain, webserver information and snapshot of defaced page) and expect this total to grow markedly over the next few months.

Underground chat channels

A number of publicly accessible channels on Discord and Telegram are used for discussions of cybercrime topics such as illicit markets and booter (DDoS) services. We currently have a collection of over 3 million Telegram message (from 50+ channels) and 2.5 million Dicord message (from over 3000 channels).

Extremist chat channels

A number of publicly accessible channels on Discord and Telegram are used for discussions of "extremist" content with hate groups, extremism and radicalisation. We currently have a collection of over 2.6 million Telegram message (from 60+ channels) and 5.3 million Discord messages (from over 600+ channels).

Investment scams

Investment fraudsters and financial scam operators aim to lure victims into making investments in fake schemes, which either promise very high rates of return (with very low risk), impersonate some genuine companies or do not exist at all. We have been collecting scam reports from multiple sources including blocklists, scam reporting forums and online social media posts and we currently hold details of more than 150,000 associated web URLs.

Modded apps

Third-party app marketplaces are now filled with large numbers of modded Android apps offering similar (or more) functionality compared with the original application. The diversity of these marketplaces has opened new opportunities for malicious actors: modifying the in-app ad networks, including malware, etc. We are building a dataset of these apps collected from several sources -- 3000 so far and growing.

Reflected DDoS victims

We operate nearly 100 sensors in various locations around the world that record incoming UDP packets (in PCAP files in the first instance, but duplicates are summarised in text files).

The sensors respond to packets associated with scanning for 'reflectors' that are to be often used in distributed reflected amplified denial of service (DDoS) attacks -- and this means that our sensors are often called into play for these attacks, which means that we have a record of the victim IP.

Our dataset starts in March 2014, though the number of sensors varies over time. A high level description of our collection system and a summary of the data appears in our paper Daniel R. Thomas, Richard Clayton, and Alastair R. Beresford: "1000 days of UDP amplification DDoS attacks", APWG eCrime, 2017.

Note that the dataset is large (we have data on over 4 trillion packets) so you will need to think about whether your research should only be on a subset of the data.

Mirai scanning data

The Mirai malware scans for devices that it may be able to compromise by sending out distinctive TCP SYN packets. We collect these packets when they hit our sensors. We have data from about a dozen sensors from mid-November 2016 onwards, but significantly better coverage from a (circa) /16 after April 2017 and from a further (bit more than a) /14 from mid-October 2017 onwards.

Mirai malware (etc)

We operate honeypots that are specifically intended to collect Mirai malware (they also get a certain amount of bycatch -- copies of QBot variants etc.). We have around 15000 binaries (which is not as impressive as it sounds because a Mirai source file is usually compiled ten or more times for different CPU architectures and we collect all the variants we can).

Honeypot data

We have over 1 year of honeypot data for tcp/22 (SSH connections).

Blog spam

Our blog (https://lightbluetouchpaper.org) receives a large number (50+) of spam comments each day. We now have a collection of around 200K of these.

Although the comments are often just lists of URLs to pharma sites, they are sometimes socially engineered to try and encourage us to make them visible -- and from time to time the posters get confused and post their templates rather than the customised result.

Phishing URLs

We used to collect phishing URLs and website contents, but this collection has not been maintained and is now very dated. Unless you are a historian, we suggest that you should approach the APWG who continue to collect this type of data.

Phishing emails

We have a dataset of phishing emails (sent to a small set of email addresses) from 2005 onwards. The dataset contains numerous duplicates so numbers are inexact, but contains up to 10000 emails per year until 2010 and around 1000 emails a year since then.

Advanced Fee Fraud (419 scam) emails

We have a dataset of Advanced Fee Fraud (sometimes called 419 scam) emails (sent to a small set of email addresses) from 2001 onwards. The dataset has duplicates so numbers are inexact, but contains around 75000 unique emails. For the past few years offers of loans have been included in this dataset.

Email spam

We have a dataset of "spam" email (sent to a small set of email addresses) dating from 2003 onwards (and indeed some spam email from the mid-1990s onwards as well). Numbers of emails vary dramatically from year to year (and there are large numbers of duplicates) but in recent times exceed 2000 emails a month. Note that for relevant periods phishing and Advanced Fee Fraud emails have been extracted into separate datasets.

Email spam feed

We also have a very substantial dataset of email spam provided by Abusix. Our data starts in July 2020 and runs at around 10 to 20 million emails a day (c 100G in a highly compressed format) so it is worth considering resource requirements, and talking to us, whilst considering your experimental design.

Domain names

We have an archive of registered .com and .net and domain names dating back to September 2009. We are now archiving all the "new" TLDs. We may not be able to share the raw data, but if your research needs this type of data we may be able to create derivative datasets for you.

IP address allocation

We maintain a database recording IP address allocations over time (so that it is simple to check on the usage of IP addresses on past dates.

...old datasets

We also have the relevant datasets for most of the cybercrime papers we have published over the past 10 years (on phishing, Ponzi schemes, IM worms etc.)