Classifying Emotions in Tweets

Leonid Shpaner, Jose Luis Estrada, and Christopher Robinson

Python - Jupyter Notebook

This notebook implements a text mining sentiment analysis project.

Data Source:

http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip

The zip file contains data in CSV format with emoticons removed. Data has 6 fields:

0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
1 - the id of the tweet (2087)
2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
3 - the query (lyx). If there is no query, then this value is NO_QUERY.
4 - the user that tweeted (robotickilldozr)
5 - the text of the tweet (Lyx is cool)

Logistic Regression Results

The results of the classification model were surprisingly good. We decided to go with a simpler approach to detecting dangerous tweets and in the end, we thought it worked well for our purposes. When we looked at our test tweets the model seemed to do a good job of deciding what tweets we could consider dangerous and ones we could not. The model did a good job if there was just one topic involved in the tweet, but in testing we were able to confuse the model with tweets that discussed multiple topics which included dangerous words but the two statements were not related. For example, when we looked at a tweet stating “I was going to go shooting tomorrow but I hurt my hand” it considered this a dangerous tweet because we have multiple dangerous words in the statement but referring to different things. So, in this case the model took “I was going to go shooting tomorrow but I hurt my hand” and equated it to a statement such as “I am going to shoot you tomorrow and it will hurt”, which is obviously a very different statement.

In the classification performance, our indexes indicated 0 and 4; negative and positive. The dataset also has the number 2 included, since the threshold we selected on whether a tweet is positive or negative, was 0.6. So, because of this threshold, we converted this to a binary problem, in turn, converting to neutral messages. Therefore, for the scope of this project, we wanted to know which tweets are negative, and from those, we were going to find which were dangerous based on the tags we selected. This would help us identify some of the dangerous tags. Having a prediction of 0.75 is relatively high for a text mining project of this magnitude.

Tuning hyperparameters and exploring different models can improve predictions on whether tweets are positive or negative.

Our logistic regression did not have any hyperparameters tuned, but even then the negative tweets had a better performance that we initially expected. Our performance detected how many times our positive predictions were correct. In this case, we saw that our positive tweets did not perform that well, but for the scope of this project it does not create a big concern. Recall tells us that from all actual positives how many are predictive positive. Both negative and positive tweets provided an acceptable result, and to finalize with f1 scores (the harmonic mean between precision and recall) that can be also improve after changing the hyperparameters.

Topic Modeling

The objective of LSA is reducing the overall dimension for classification. Initially, we did not feel that topic modelling was the best approach for trying to identify dangerous tweets as tweets are not like documents in the sense that they are essentially small and usually unstructured text focused on one specific topic that is likely hard to derive without first grouping texts and responses together in one document. In our dataset, the relationship between texts is unknown and any topic pattern is likely lost. The results of the topic model were as expected with no real discernable topics emerging. For example, our first topic has keywords good, day, http (which means there was originally a link there), work and go. So, in reality, we think that we are essentially just seeing summaries of the tweet. This was expected as, like we mentioned before, tweets are not really documents and are very short and narrowly focused, so this type of topic modelling really does not work well here. In hind site, because LSA focusses on dimension reduction perhaps a different model such as Latent Dirichlet allocation, or LDA, would have been a better choice, although we feel the outcome would have been similar.

Next Steps

Since we used sentiment analysis in conjunction with tags we deemed dangerous to flag specific posts, we believe the most obvious next step would be to add more dangerous tags to the model. We added tags we felt were dangerous but given more time we could have done a more thorough analysis on tweets to compile dangerous tags as I am sure there are much more out there, many of which we may not familiar with. Additionally, because we were focused specifically on violence, we could expand the model to include hate speech or other more subtle forms of speech which may not necessarily be identified as a direct threat of violence but could be forms of bullying or intimidation.

Another enhancement would be more tuning of the model hyperparameters or building different models based on performance. Due to time constraints, we built a simple effective model but given more time we could expand the current model or explore more complex models.

Lastly, as a future enhancement we could add functionality to act on dangerous tweets as well. For example, if the model determines a series of tweets to be dangerous the program could then suspend the users account and/or notify authorities. The model would have to be well tested and very good at determining dangerous tweets at this point however because we do not want an automated process going rampant on twitter shutting down peoples accounts and calling the authorities on them because they happen to be talking about a hunting trip or accidently cutting themselves with a knife while making dinner.