Mentioned in my blog post of last week, I want to get into the different features that I see as very relevant to detect sentiment in tweets. Some of them are easy to detect, some aren’t. Let’s get an overview on Negation first…

In the past months I read lots of papers discussing ways to detect sentiment in micro blogs (better known as tweets, user reviews, SMS, etc.). Most of these suggest using specific features depending on the use case (i.e. Wilson et. al. (2006) on the matter of Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis). Of course, the authors have much more expertise then me, I won’t be able to integrate all the suggested methods. I was setting high goals for myself and I missed some of them or didn’t even tried it since time was at stake. But before I get too far into reviewing my own work, let me start explaining those ideas that I implemented or are almost implemented.

Negators and Negated Tokens:

Negations are a most common source of errors in sentiment detection if the attempt is simply executed like I did with the unigram scoring. If a sentence contains a Negator, then a negated word should switch its score to the negative. If for example a phrase goes ‘I don’t like school.’, the Negator ‘don’t’ negates the tokens ‘like’ and ‘school’.

Detecting Negators was one of my first ideas after reading some literature. Since I couldn’t find useful lists of negation words, I collected them on my own using a list of words ending with -‘n’t’ wiktionary.org, common negative words from grammerly.org, and some great insight in usage of negations from linguisticsgirl.com. I was thinking about adding an algorithmic approach to detect morphological variations but so far didn’t have the motivation. To add some morphisms, I used my own experience with the flaws of grammar on twitter and the variations mentioned at wiktionary.org.

My result can be found here. Besides simply listing the Negators, I also added reference words (which would be the morphologic root, like ‘cant’ -> ‘can’t’), the most common POS tag and sentiment score from the Sem140Unigram-lexicon if available:

<sem140 score>
isn't isn't negative V -0.82
isnt isn't negative V -1.16
ain't isn't negative V -0.128
aint isn't negative V 0

With such a lexicon it is possible to tag Negators in Tweets and the first step is done. I continued with the search of negated words which took me a few days. I started with the extraction of the most common bigrams containing Negators and quickly realized after looking up those won’t help very much. At least I came to the decision that I won’t detect many negated words that appear in front of a Negator. Most negated words that might carry sentiment information (most commonly verbs, adjectives or nouns) appear after the referring Negator.

Since I could only find Negators with a POS-tag R (Adverbs: i.e. ‘not’, ‘never’, ‘barely’), V (Verbs: i.e. ‘wasn’t’, ‘shouldnt’) and D (Determiners: ‘no’). Note that I only identified ‘no’ as a determiner and the most common negating adverb is ‘not’ since it is often used as a separate token after a verb (i.e. ‘do’+’not’).

I identified the most common bi- and 3- (and 4-)grams containing Negators and negated Tokens. These are as follows (Q is used as a ‘joker’, these could have any kind of POS-tag, negated tokens are bold):

  • R’VDQ
  • R’VA
  • R’VD
  • R’VPQ
  • R’V,
  • R’VV
  • R’DN
  • R’DA
  • R’RA
  • R’AP
  • D’NN
  • D’N,
  • D’NVQ
  • D’NPQ
  • V’RV
  • V’V,
  • V’VO
  • V’VPQ
  • V’V^
  • V’VN
  • V’VR
  • V’VA
  • V’VV
  • V’VDQ

Short explanation for the other letters: R – adverb; V – verb; A – adjective; D – determiner; P – pre-/postposition; N – common noun; ^ – proper noun; O – pronoun

These aren’t all possible n-grams that I found but they cover most occasions. I developed an algorithm that searches for negated tokens step-by-step. Depending on the first following token (which might be negated) it evaluates further tokens if necessary. So overall Negators and negated tokens are found and tagged properly.

In a later step in the pipeline these tokens are processed differently to untagged tokens. For now I set the scores for Negators to 0 and multiply the scores of negated Tokens by -1.5 to compensate the Negator score.

Here are the changes in regard to the unigram approach (again 1500 random Tweets, 5 runs, values in %):

#1 #2 #3 #4 #5 Ø
Positive Correct 66 68 66 69 68 67
Positive As Negative 7 8 6 7 6 7
Positive As Neutral 27 24 28 24 26 26
Negative Correct 49 46 50 51 48 49
Negative As Positive 21 20 20 21 22 21
Negative As Neutral 30 33 30 28 29 30
Neutral Correct 41 43 43 35 39 40
Neutral As Positive 44 43 44 49 45 45
Neutral As Negative 15 15 14 17 16 15
Ø(correct) 52 52 53 51 52 52

I marked the significant changes (5+ % changes) green (if improved) and red (if worsened). As you can see, the added evaluation of Negations improved the detection of positive sentiment by a good amount. Still the improvements are small on the overall scale, since I could not see much better sentiment detection for negative and neutral sentiment. The correct detection for those two didn’t change. It should be noted, that more negative sentiments were detected as positive than before.

The Verdict

I’m happy that the overall score changed at least a bit in the right direction. the increased correct detection of positive sentiment is great and I hope that further development will increase detection of neutral and negative sentiment, too.

I’m beginning to wonder if the Sem140Unigram scores are the best to use for my attempt or if I’m using it correctly. I might try to get the polarity from the appearances in positive or negative tweets so that I will only use high polarity words for scoring. Either that or I’ll look for another lexicon.

If you have any questions or suggestions, write a comment below. I’ll be happy to read and answer those.

A Data Science consultant working at Materna. He occasionally blogs about data and related topics here and is the host of the Dortmund Data Science Meetup.