Defining Data Science Using the Common Crawl Web Corpus – 1

Recently I finally finished one of my major projects (my thesis). So now I have some spare time for doing smaller projects after work. Like this one:

I discovered lots of interesting huge datasets while looking for vast unstructured corpora, especially text. I found the Registry of Open Data on AWS containing data about air quality, stock markets, satellite images, genomes, Amazon reviews and much more. My favourite though is the Common Crawl Corpus.

The Common Crawl (CC) is a corpus containing over 5 billion (5.000.000.000!) websites in raw format (WARC), meta (WAT) data and text (WET). This takes up about a petabyte of storage or maybe some more terabytes. Doesn’t matter at this volume size. This amount of websites in multiple languages has been acquired in the last 7 years and is distributed with a CC. In April alone they crawled about 80 TB.

Since I’m always looking for new opportunities to improve my methodical and technical knowledge I want to know where to invest my time. I found that job postings are a pretty good indicator for current requirements for professionals and what skills are scarce in companies. On the other hand these postings give insight in how the role of Data Scientists (and similar named) is seen.

Common Crawl has an indexing API access to have a look which websites are in the sets that are typically stored in monthly batches. You can find out how to use it here. If you are using Python 3 you will run into errors, a solution has already been offered but has yet to be merged. You can find the pull request by Martin Kokos here. As soon as everything is set up, write something like this to figure out how many pages (fragmented lists of urls) are available.

./cdx-index-client.py -c CC-MAIN-2018-17 https://indeed.com/* --show-num-pages

./cdx-index-client.py -c CC-MAIN-2018-17 https://indeed.com* --f url

Indeed.com is one of the biggest recruiting platforms (38 pages of lists) in the US but I couldn’t find a comfortable way to point at specific offers. So the next best source is Moster.com:

./cdx-index-client.py -c CC-MAIN-2018-17 https://monster.com* --f url

The result is much smaller but still no offers. But Monster.com has a subdomain containing these postings so you can find them with:

./cdx-index-client.py -c CC-MAIN-2018-17 https://job-openings.monster.com/* --f url

I learned about the subdomain a while ago when I was still scraping these postings myself (Which is hopefully not required anymore). The resulting set contains 547 items including 105 robot.txt files. The other urls point at job postings how I remember them. So let’s dig deeper in the next part of this ongoing series.

EDIT

A little mistake has been made. The proper subdomain is jobview.monster.com that already shows 3 pages of results for April ’18. job-openings is a different smaller one and I should evaluate if they are already part of jobview. More on the next post.

Paavo Pohndorff

A Data Science consultant working at Sopra Steria. He occasionally blogs about data and related topics here and is the host of the Dortmund Data Science Meetup.

Blogging data since 1886

Defining Data Science Using the Common Crawl Web Corpus – 1

Leave a Reply Cancel reply

Defining Data Science Using the Common Crawl Web Corpus – 1

SHORT – Dealing with embedded nul in string manipulation with R

Bring Data Science Activities to the Cloud

Leave a Reply Cancel reply