Reading and interpreting sentences with an algorithm instead by yourself is tough. Reading tweets is worse, so much worse. Let me tell you about some concepts I came up with.

You might remember my project to evaluate sentiments in short messages like tweets. I’m still working on it. And right now I am, at least in my opinion, at one of the toughest parts. So I’ll try to split it into smaller problems right now, iterating step by step. Remember, the baseline was at about 62%, marking everything positive.

Attempt No. 1: Evaluating unigrams

My first attempt was the easiest I could come up with. It was clear to me, that it will be crude and not very powerful: I used DKPro 1.6.2 so far. It’s powerful but I miss some tools to evaluate tweets. Because of that, I used the simple BreakIteratorSegmenter in my pipeline zu segment my tweets into tokens.

This would result into the following:

New cast of DWTS tba at 8pm tonight!! So excited :) The meeting tonight better be done, or "someone" will have to lend me their phone :) #thx
{New, cast, of, DWTS, tba, at, 8pm, tonight, !, !, So, excited, :, ), The, meeting,
tonight, better, be, done, ,, or, ", someone, ", will, have, to, lend, me, their,
phone, :, ), #, thx}

 It is segmented, but as you can see, especially punctuation and other special characters like (#, @) are handled in a way that tweets shouldn’t. Emoticons are destroyed, same with hashtags and account names. Another major problem is the segmentation of multiple punctuations like (!!!!!!!!!!!!). Still tho, it is good enough for a first attempt.

After the tweet is segmented, its tokens need some scoring. Some research of state-of-the-art literature later, I found a good amount of help created by the group of Saif Mohammad at the NRC Canada. They won the previous SemEval challenge and published a few papers and some of their data sets. Most interesting for my case, they published a big list of words with associated positive and negative sentiments created from the Sentiment140 corpus. You can find this and other sets right here.

The Sentiment140Lexicon contains tokens with scores depending of emoticons in the tweets they contained. A really awesome list! Here is an example (format: ):

hiphop 0.677 30 16
#elevensestime 0.677 15 8
@shamim86 0.677 15 8
weston 0.677 15 8
independence 0.677 30 16
@naniwaialeale 0.677 15 8
risks 0.677 15 8
williamsburg 0.677 15 8
goss 0.677 15 8
sab 0.677 15 8

Still tho, the list is filled lots of trash. As seen above, it contains twitter account names (@xyz), hashtags and lots of different punctuations (like repeated exclamation marks in different sizes). Lots of words might have been linked to negative or positive sentiments because of events in the time frame of collection or they are just used alot both in negative or positive statements and are in that case at least controversial (i.e. schools -0.358 153 230). But putting those aside, a great amount of decently tagged gold is left.

For my first attempt:

  • I ignored account names, changing their scores to 0
  • I ignored the counts, both positive and negative
  • I summed up the scores, ignoring the amount of pos. or neg. words
  • I decided to define all tweets with a score of >1 to be positive, <-1 to be negative and the rest to be neutral

The result wasn’t bad. Analyzing the training data and scoring the tweets resulted in an accuracy of 46.8 %. Compared to the baseline of 37.3 %, this is at least an improvement. I did some tweaking on the ranges for the final score, but the initial +1/-1 was pretty good already.

Planned Improvements

I am already ahead with development but wanted to share my effort. I changed my DKPro framework to 1.7.0 so I can use the arktools-gpl. Using the ArktweetTokenizer to segment the tweets improved the final score to 50.4 % with the same methods as above. Further detail on the arktools dependency will be posted in a future post.

If ou have any questions or suggestions, leave a message.