How did I built real-time twitter sentiment analyser?

Last week was pretty much about my hobby project – this – Twitter Sentiment Analyzer

So, how did I built this semi-magical thing over a weekend?

Here goes the answer..

First let’s took at the domain – data

The very first problem was to access tweets, filtered by a search query.

Well, Twitter provides APIs for that!
The first one is REST API, which provides you bundle of tweets (max 100 per request) (with some rate limiting – we can’t sent these requests too often)
The second one is Streaming API, now, this one is interesting!
It’ll provide the real-time stream of tweets to the client! It’s a ‘hot’ stream, which can provide us lots of tweets, if handled correctly.

Now, I wanted to learn something new – I’m now getting bored with old-good Laravel, so, I looked into it – and , it’s NodeJS now!
NodeJS was the perfect solution for application – for it’s performance in real-time web apps, and NLP tools available (although python has most NLP tools available, and NodeJS has ports of it)

And oh, just Node? That seems… Incomplete..
Framework? Frameworks? … …..
Ah, It’s SailsJS!
A mature MVC framework for NodeJS!
It follows convention-over-configuration, so we can apply everything we know about web app development in NodeJS’s way!

Now, what about the data storage?
Umm.. MySQL.. Postgres.. maybe…. MongoDB……
Ah, it’s MongoDB!
It fits perfectly into this – JavaScript full-stack ecosystem, it’s fast, scalable and has least impedance mismatch with JavaScript!

Now for the frontend – Good-old AngularJS, scaffolded with Yo!

And I got started!

First I built a prototype to test whether I can fetch, store, retrieve tweets, using NodeJS’s twitter package (consuming twitter API) and Mongoose (ODM for MongoDB)
Yay, it was successful, now I can move this to the framework, and begin on the real thing.

So, just with Sails generators, I quickly created Tweets model (SailsJS uses Waterline as ODM, TweetController and it laid out basic CRUD with REST API for me!
Now, with twitter package, created a TwitterService to fetch some tweets using Twitter REST API and store these to the MongoDB.

Now, the real part comes into picture, sentiment analysis!
Now, I had a lot of options for it, there are many nodejs packages, like speakeasy, sentiment , sentimental.
After going through the performance and APIs provided by each of them, I did choose sentiment.

Now, if I give any statement to process, to the sentiment, it gives me following output –

  • (Sentimental) score – if this score is positive, the sentiment for the statement is positive, same goes to negative.
    If the score is zero, the statement is neutral.
    Internally, it’s AFINN based analysis. AFINN is a list of English words rated for valence with an integer between minus five (negative) and plus five (positive).
  • Tokens – the statement is tokenized, and it contains array of ALL the words found in the statement.
  • Type of sentence – a score which indicates whether the statement is comparative.
  • Positive words, Negative words – arrays of positive and negative words found in the sentence.

So, technically, the thing was done!
Is this sufficient?
I’m just showing overall sentiment, where’ the interesting stuff? How can I, as an end user, connect the dots??

Now this got interesting!

Let’s start counting words then!
Let’s count everything!

And I counted`em all!
And… shitty results!
Especially for the tokens! I counted ‘top words’ that is most frequently occuring words, and most of them were ‘A’, ‘RT’, ‘http’ and blah blah..

Now what?
Now, the problem is, to count proper nouns, so filter those from other crap!
Part of speech tagging! Oh yes, that’s a real thing!
And I got pos package, based on posjs (I was considering Stanford simple NLP, but it was too difficult to fit into the ecosystem)
So I started filtering out all singular and plural proper nouns.

Now, a small problem,
As twitter streams have words in natural language, people make spelling mistakes, or sometimes the it’s just the BrE vs AmE..
So, if I have two words like Cats and Catz, these two are treated differently, although the intent is same!
Now the package natural came into picture, naturally!
It’s a comprehensive NLP toolkit for NodeJS.
So, I improved my algorithm by taking string distance (Jaro-Winkler Distance) into consideration (the output 1 means exact match and 0 means no match), so if this distance is greater than 0.8, both of the words are considered same.

Umm.. something is missing…
What about the favourite counts? Should we just let that data go away? NO! Because data is the oil of 21st century, so I’m not letting anything get away!
Let’s make it count, too! Let’s multiply favourite-count of a tweet by a fraction (0.1) and add that to the score of the tweet!

Ah it’s all set now!

It’s working, it’s fun..
But wait.. something seems missing..

Oh, I can make it real-time!!
Now the streaming API!
Using the same twitter package, I wrote a code so that when user sends a request, REST API call is made ALONG WITH Twitter stream api is called, stream is started and saved to mongodb.

A seperate process starts now, which processes all the tweets and returns analysis result..

Now it is real time..
Or is it? Is it real time for the client??
Now what?

Good old friend!
A real-time data storage platform as a service
With first class support for angularJS, real-time data binding with it’s AngularFire library!

So my tweets processor just started saving the end results to Firebase!
I hooked up my angular app with that storage, in a 3-way binding fashion..
It’s all websockets!

Here we have it, a real-time twitter sentiment analyser!!!