In this post we are going to explore sentiment analysis using python. Let us first define the problem. Given a
corpus of
documents, we want to train a
classifier to classify documents into one of several classes:
positive: e.g., "that movie was awesome"
negative: e.g., "that movie sucked"
We may also want a third category:
neutral: e.g., "I saw the movie at the Odeon".
In reality, we may have a fourth category unknown where we cannot tell. For instance, it may contain a document in a foreign language, a set of stop words only, or a set of words that our classifier does not understand, e.g. SMS acronyms (e.g., "cu l8r"). For simplicity, we can simply not classify these documents or we can categorize them as neutral.
To train the classifier we will need to choose a set of features, the facets of data that the classifier will work on. The simplest scenario is to use each word in the document. (Later, we will discuss more complex features and other features.)
This is a supervised learning problem so we will need a labeled corpus of data. That is, a set of documents each of which has been labeled with one of our class names or IDs. One of the simplest ways is a CSV or a TSV file:
pos that movie was awesome
neu I saw the movie at the Odeon
pos I love this place
pos I am happy
....
We will need a learner, an algorithm that will learn to classify. In most examples that I've seen online, people have used a naive Bayes classifier but there are many others that one could choose.
Importantly, sentiment analysis is very context sensitive. If you train a classifier using movie review data, it likely will not fare well classifying documents about election results or your startup's product. A classifier trained on US english tweets may or may not classify UK tweets well.
Finally, we cannot expect to assign sentiment correctly 100% of the time. Even humans can often disagree about the sentiment of documents. There are several reasons:
- Cultural differences mentioned above. For example, "rocking" and "sick" are positive adjectives to members of some demographics and cultures but not others.
- English is a hard and ambiguous language. For instance, consider Chomsky's example: "old men and women". Is that [old men] and [old women] or does it mean [old men] and [women]. Another example is Eats, Shoots & Leaves.
- Sarcasm, innuendo, and double entendres. This is one of the current challenges in NLP. For a fun example take a look at the problem of detecting that's what she said.
- That people often use a bag of words model for sentiment analysis, at least as a first pass. That is, we analyze a document as a set of words and not a phrase. Thus, we will miss that the "not" in "not good" negates "good". In general, we will miss double negatives and other qualifiers. I love this illustrative example:
During a lecture the Oxford linguistic philosopher J.L. Austin made the claim that although a double negative in English implies a positive meaning, there is no language in which a double positive implies a negative. To which Morgenbesser responded in a dismissive tone, "Yeah, yeah."
Now we have the background out of the way, let's starting building a concrete example.
First, let's grab a corpus. I'm taking the UMICH SI650 - Sentiment Classification training set from
Kaggle.
This is a tab-delimited file with 7086 sentences tagged as 1 or 0.
head training.txt
1 The Da Vinci Code book is just awesome.
1 this was the first clive cussler i've ever read, but even books like Relic, and Da Vinci code were more plausible than this.
1 i liked the Da Vinci Code a lot.
1 i liked the Da Vinci Code a lot.
1 I liked the Da Vinci Code but it ultimatly didn't seem to hold it's own.
1 that's not even an exaggeration ) and at midnight we went to Wal-Mart to buy the Da Vinci Code, which is amazing of course.
1 I loved the Da Vinci Code, but now I want something better and different!..
1 i thought da vinci code was great, same with kite runner.
1 The Da Vinci Code is actually a good movie...
1 I thought the Da Vinci Code was a pretty good book.
There are many duplicates. Let's remove them:
cat training.txt | sort | uniq > uniq_training.txt
How many positive and negative samples remain?
cat uniq_training.txt | grep ^1 | wc -l
772
cat uniq_training.txt | grep ^0 | wc -l
639
We need to extract features from a document. We'll take unique, lowercase words with more than two characters:
def extract_features(document):
features={}
for word in set(document.split()):
if len(word) > 2:
features['contains(%s)' % word.lower()] = True
return features
>>> extract_features('" A couple of very liberal people I know thought Brokeback Mountain was " stupid exploitation.')
{'contains(very)': True, 'contains(people)': True, 'contains(couple)': True, 'contains(mountain)': True, 'contains(was)': True, 'contains(brokeback)': True, 'contains(liberal)': True, 'contains(exploitation.)': True, 'contains(know)': True, 'contains(thought)': True, 'contains(stupid)': True}
We need to read in our documents:
documents=[]
f = open("uniq_training.txt","r")
for document in f.readlines():
parts= document.strip().split("\t")
documents.append((parts[1],bool(int(parts[0]))))
>>> documents[0]
('" A couple of very liberal people I know thought Brokeback Mountain was " stupid exploitation.', True)
Now extract features from each document in our corpus and split into a training set (80%) and a test set (20%):
import random
random.seed(1234) #so that you can reproduce my results if you wish
random.shuffle(documents)
import nltk
n_train = int(0.8*len(documents))
training_set = nltk.classify.apply_features(extract_features,documents[:n_train])
test_set = nltk.classify.apply_features(extract_features,documents[n_train:])
>>> training_set[0]
({'contains(very)': True, 'contains(people)': True, 'contains(couple)': True, 'contains(mountain)': True, 'contains(was)': True, 'contains(brokeback)': True, 'contains(liberal)': True, 'contains(exploitation.)': True, 'contains(know)': True, 'contains(thought)': True, 'contains(stupid)': True}, True)
Finally, let's now train our classifier:
>>> classifier = nltk.NaiveBayesClassifier.train(training_set)
>>> nltk.classify.accuracy(classifier, test_set)
0.911660777385159
Whoah, 91% accuracy isn't bad at all given that we had a fairly balanced training set:
ct_pos=0
for d in training_set:
if d[1]==True: ct_pos+=1
print ct_pos, len(training_set)-ct_pos
617 511
By that I mean, I would have been suspicious if we had a very imbalanced training set and say 95% of samples were positive.
The proper way to assess the performance is to examine precision, recall and F-scores:
import collections
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (feats, label) in enumerate(test_set):
refsets[label].add(i)
observed = classifier.classify(feats)
testsets[observed].add(i)
>>> print 'pos precision: %2.3f' % nltk.metrics.precision(refsets[True], testsets[True])
pos precision: 0.951
>>> print 'pos recall: %2.3f' % nltk.metrics.recall(refsets[True], testsets[True])
pos recall: 0.884
>>> print 'pos F-measure: %2.3f' % nltk.metrics.f_measure(refsets[True], testsets[True])
pos F-measure: 0.916
>>> print 'neg precision: %2.3f' % nltk.metrics.precision(refsets[False], testsets[False])
neg precision: 0.871
>>> print 'neg recall: %2.3f' % nltk.metrics.recall(refsets[False], testsets[False])
neg recall: 0.945
>>> print 'neg F-measure: %2.3f' % nltk.metrics.f_measure(refsets[False], testsets[False])
neg F-measure: 0.906
and now the confusion matrix:
observed=[]
actual=[]
for i, (feats, label) in enumerate(test_set):
actual.append(label)
observed.append(classifier.classify(feats))
print nltk.ConfusionMatrix(actual,observed)
| F |
| a T |
| l r |
| s u |
| e e |
------+---------+
False |<121> 7 |
True | 18<137>|
------+---------+
(row = reference; col = test)
These are all great numbers.
Now we can use it:
>>> classifier.classify(extract_features("that movie was awful"))
False
>>> classifier.classify(extract_features("that movie was great"))
True
>>> classifier.show_most_informative_features(16)
Most Informative Features
contains(awesome) = True True : False = 37.8 : 1.0
contains(sucked) = True False : True = 31.0 : 1.0
contains(hate) = True False : True = 24.0 : 1.0
contains(love) = True True : False = 17.8 : 1.0
contains(heard) = True False : True = 10.1 : 1.0
contains(kinda) = True False : True = 8.4 : 1.0
contains(it,) = True False : True = 7.6 : 1.0
contains(want) = True True : False = 6.9 : 1.0
contains(evil) = True False : True = 5.6 : 1.0
contains(those) = True False : True = 5.2 : 1.0
contains(didn't) = True False : True = 5.2 : 1.0
contains(has) = True False : True = 5.2 : 1.0
contains(think) = True False : True = 4.7 : 1.0
contains(miss) = True True : False = 4.7 : 1.0
contains(watch) = True False : True = 4.6 : 1.0
contains(liked) = True True : False = 4.5 : 1.0
>>>
These terms make a lot of sense: awesome, suck, hate, love...
We trained our classifier to learn keywords that have particularly positive or negative associations (the most informative features). Once it has learned that list it is then easy to apply to a new input document. There are list of these learned words available such as
AFINN and
pattern. For these then, running sentiment analysis is trivial.
In pattern, "the sentiment() function returns a (polarity, subjectivity)-tuple for the given sentence (based on the adjectives in it), with polarity between -1.0 and 1.0 and subjectivity between 0.0 and 1.0".
Let's fire up a sentiment analyzer and use it
>>> from pattern.en import sentiment
>>> print sentiment("that movie was awesome")
(1.0, 1.0)
>>> print sentiment("that movie sucked")
(0, 0)
>>> print sentiment("that movie was ok")
(0.5, 0.5)
>>> print sentiment("that movie was fairly good")
(0.6, 0.8500000000000001)
That's all there is to it.
You can see that returning a continuous value from (-1,1) means that we could easily define neutral documents as those with some intermediate value, say (-0.1,0.1).
Let's return to the issue mentioned earlier that this is just a bag of words model and the issue "not good" versus "good". One thing we can do is to include bigrams in our features. That is, adjacent pairs of words:
>>> from nltk import bigrams
>>> bigrams("That movie was awful".split())
[('That', 'movie'), ('movie', 'was'), ('was', 'awful')]
Let's add bigrams to our extract features function:
def extract_features(document):
features={}
for word in set(document.split()):
if len(word) > 2:
features['contains(%s)' % word.lower()] = True
for bigram in bigrams(document.split()):
features['contains(%s)' % "_".join(i.lower() for i in bigram)] = True
return features
Retraining our classifier with this, we get 93.6% accuracy and a more interesting feature list: I hate, I love, I think, love the...
>>> classifier.show_most_informative_features(16)
Most Informative Features
contains(awesome) = True True : False = 37.8 : 1.0
contains(sucked) = True False : True = 31.0 : 1.0
contains(hate) = True False : True = 24.0 : 1.0
contains(love_the) = True True : False = 22.4 : 1.0
contains(i_hate) = True False : True = 22.2 : 1.0
contains(i_love) = True True : False = 22.1 : 1.0
contains(love) = True True : False = 17.8 : 1.0
contains(code_was) = True True : False = 15.7 : 1.0
contains(i_think) = True False : True = 14.9 : 1.0
contains(3_was) = True True : False = 13.0 : 1.0
contains(i_like) = True True : False = 11.9 : 1.0
contains(like_harry) = True True : False = 10.2 : 1.0
contains(heard) = True False : True = 10.1 : 1.0
contains(i_heard) = True False : True = 9.3 : 1.0
contains(the_mission) = True True : False = 8.6 : 1.0
contains(kinda) = True False : True = 8.4 : 1.0
Hopefully you can see that you can get some good sentiment results with just a few tens of lines of python, if you have the right libraries.
Here is the source code:
p-value.info github