I’ve refined and partially overhauled my algorithms to analyze sentiment in Tweets over the last weeks with some notable results. Here is what I came up with so far. I am starting to feel like I’m doing science instead of the tedious tasks I did over my previous semesters.

My software project is taking shape so far and that makes me really happy. I’ve put a lot of time into it lately and came up with some decent to good ideas to improve almost every part of my program. I previously wrote about my baseline being right 62% of the time. Well that was bollocks and I solved that issue with a new system.

A brand new Reader for a proper Baseline

So here is what I think is much more fitting. If I want to create an algorithm to detect one of three sentiments in a tweet (being either positive, negative or neutral) a baseline should be 1/3. Previously I received a baseline of 62% when I simply marked every tweet as a positive sentiment tweet. What that means is, that simply 62% of all tweets were positive with the other 38% being negative or neutral. That would have given me a lot of improvement if I would have changed the baseline to mark all as negative or neutral setting it to a baseline around 20% (I’m estimating). That’s not very scientific, actually quite the opposite. Also I used other baselines, too. I don’t know how I got those anymore…

So now I have added a Reader that collects an equal number of tweets from each sentiment. These are picked at random, so that the processed Tweets aren’t the same each time. This is quiet useful, since I won’t just improve my code on the same tweets but can verify on different sets. Also I added a parameter to the Reader to define the number of tweets in total. Normally I’m testing my program on 900 to 2100 tweets in total to get an overview of the efficiency I can currently achieve.

There are some bugs right now, that I will solve in the near future. But mostly it works fine. You just shouldn’t take a parameter that is more than trice the number of the amount of the lowest appearing tweets (i.e. if there are only 500 tweets with negative sentiment, you should not use a value higher than 1500). In the future issues like these should be caught by the Reader.

With this set up I can finally work against a constant baseline of 33%.

For a better Understanding: ArkTools

So with my first attempt I stated using unigram scores offered by the group around Saif Mohammad. This already went really well but still wasn’t better than tossing a coin.

I have integrated arktools-gpl to my project for a much better Segmentation and especially for its very good Part-of-Speech (POS)-Tagging. Arktools is based on this paper by Kevin Gimpel et.al., which is a fantastic piece to read for everybody interested in the topic. The dkpro dependency contains 2 relevant classes so far:

  1. ArkTweetTokenizer
  2. ArkTweetPosTagger

The Tokenizer does what the BreakIterator did before with the major difference that the Tokenizer works incredibly good on twitter-specific sentences. No more hassle with multiple punctuations like ‘?!?’ or ‘:-)’ and other special characters. The same goes for negated verbs and other tokens.

The POS-Tagger is, in my opinion, the solution I was searching for for a while. Take a standard POS-Tagger, like the one offered in the StanfordCore which is a really mighty tool. But those ‘classic’ tools have one major drawback that arktools solves: Tweets are full of metaphones, weird n-grams consisting of punctuations (like emoticons) and way to often the grammer is totally broken. If you are a linguist, I guess I can feel your pain.

Arktools uses a reduced tag set that is sadly not compatible with common parsers (like the StanfordParser from the StanfordCore mentioned above). So I have to be creative from here on. Maybe some day in the future we’ll have a good parser for tweets in the dkpro framework. So far a can only wonder what impact the usage of tools like the TweeboParser would have on my project. But currently it is only available in Python and translating or integrating it into my Java project feels like it would take too much time.

Using the simple UnigramScorer I received some decent results already after I included the Tokenizer. Using 1500 tweets at random (5 evaluation runs) got me the following results:

#1 #2 #3 #4 #5 Ø
Positive Correct 58 61 58 64 58 60
Positive As Negative 7 8 9 7 8 8
Positive As Neutral 35 31 33 29 34 32
Negative Correct 48 49 48 51 48 49
Negative As Positive 17 17 17 15 17 17
Negative As Neutral 35 34 36 34 35 35
Neutral Correct 41 42 43 46 43 43
Neutral As Positive 44 45 41 42 42 43
Neutral As Negative 15 13 16 11 15 14
Ø(correct) 49 51 49 54 50 51

As you can see, just using the unigram-scores and summing up all scores will result in a 50% accuracy. As before I define positive sentiment when the score is above 1 and negative if it is below -1.

Positive sentiments (60%) have a 10-20% higher chance to be detected correctly compared to negative (49%) and neutral (43%). But the interesting part that can be seen here is the type of sentiment when the detector was wrong. A most common error are positive or negative sentiments, that did not score high (32%) or low (35%) enough and neutral sentiments that achieved to high (43%) or low (14%) scores. I tried different parameters for positive and negative scores (i.e. setting the minimum score for a positive sentiment to 1.25+). And even after lots of tinkering the total results didn’t get any better. So my initial conditions were absolutely ok in my opinion.

Adding some Meat: The Features

To improve on the matter I suggest some specific features that have to be included to the algorithm. I’ll introduce these here but will get into each one in a later blog post. Some of them are already integrated into my project, some are in development and some are still waiting to be included:

  1. Negating Tokens: Negations are a big player in changing sentiments of phrases. I identified most negations used in short messages like tweets for further usage. Their score should be set to 0 and the negated tokens have to flip their sentiment.
  2. Negated Tokens: Detecting negations is only the first step. The next is to search for the negated token. These are most often positioned after the negating token. I have searched for the most common 3-grams containing negations and evaluated their POS-tags. With those I came up with a system to find negated Tokens. Their score is flipped and multiplied with 1.5 right now.
  3. ALL CAPS: If an author writes down words in ALL CAPS it might have been very important to the message. Most of these Words are sentiment heavy. I multiply their score by 3 right now.
  4. #Hashtag: Hashtags are most often very important to the tweet and are scored (if not yet known in the unigram lexicon by Mohammad et.al.) with 1.5 times the score of the word without the ‘#’.
  5. Intensifier and Intensified Tokens: Some Tokens are used as intensifiers (i.e. ‘super awesome’, ‘totally drunk’). The intensity can be defined and can be used to increase or lower the score of an intensified token. Also used as intensifiers are punctuations (i.e. ‘This is great!!!!!’).
  6. Emoticons: Within the Internet there are lots of use cases for emoticons to express happiness, sadness, anger and so on. I already collected lots of different emoticons, their sentiment and expression. These are very useful to detect the sentiment of tweets. Most often simply detecting emoticons would suffice.
  7. Elongations: Some authors use elongations to further increase the importance of a specific token (i.e. ‘I loooooooove ice cream’, ‘woooohoo, party!’)
  8. Positioning of Tokens in the Tweet. This is a total ‘maybe’ for this project. I might consider adding different scores for sentence positioning. I also might consider evaluating sentences by their own, right now I simply evaluate the whole tweet.

Planned Improvements

Some of the above Features are already added, some aren’t. I’ll give another update soon-ish. If you have any questions or suggestions, don’t hesitate and make a comment below.