Crawling Twitter – A Long Story about Patience

Turns out, twitter.com doesn’t like it when you request tons of tweets and it took me quite a while to get a decent number of them. Here is a little update, what I received so far.

I collected some of those annotated and anonymized tweets over the last few days. There is a lovly python script twitter_download written by Alan Ritter (Thanks Alan!). Every 15 minutes I got about 300 more tweets in my ever growing training data file and after about 48 hours (couldn’t collect while traveling) I marked the 20.000th tweet.

I used my evening today to evaluate my shiny loot. The first set (with annotated phrases) contains 9451 phrases (some tweets with multiple phrases). Sadly there are 1796 unusable phrases, since authors deleted some of these tweets since its first collection.

4780 of these 7655 (about 62%) usable phrases contain positive sentiments. Hooray, social platforms are mostly used for positive stuff! Or not? I guess there were more negative sentiments, which were deleted later by their writers.

So what’s next? Next time I will report about my progress on the analyzation of my training data. So far, I wrote a half decent reader, I’ll write on that, too.

Paavo Pohndorff

A Data Science consultant working at Sopra Steria. He occasionally blogs about data and related topics here and is the host of the Dortmund Data Science Meetup.

Blogging data since 1886

Crawling Twitter – A Long Story about Patience

Leave a Reply Cancel reply

Crawling Twitter – A Long Story about Patience

Developing Tools for Sentiment Analysis in Twitter

Using UIMA Pipelines – A Quick Overview

Leave a Reply Cancel reply