University of San Diego, M.S. Applied Data Science
Christopher Robinson, Jose Luis Estrada, Leonid Shpaner
GitHub Repository
https://github.com/lshpaner/twitter_emotions
This notebook implements a text mining sentiment analysis project
First we fetch the data from google drive
Data Source:
http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
The zip file contains data in CSV format with emoticons removed. Data has 6 fields:
0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
1 - the id of the tweet (2087)
2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
3 - the query (lyx). If there is no query, then this value is NO_QUERY.
4 - the user that tweeted (robotickilldozr)
5 - the text of the tweet (Lyx is cool)
%matplotlib inline
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import NMF, TruncatedSVD, LatentDirichletAllocation
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score, roc_curve
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
import string
from string import punctuation
import nltk
import re
import subprocess
from wordcloud import WordCloud
from collections import Counter, defaultdict
nltk.download('punkt')
from nltk.corpus import stopwords
stop = stopwords.words("english")
pd.set_option('display.max_rows', None)
#Pulling data from Google Drive from Stanford Engineering Computer Science Department's website
def download_gdrive(id, print_stout=True):
coomand = 'gdown https://drive.google.com/uc?id={}'.format(id)
returned_value = subprocess.run(coomand, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
if print_stout: print(returned_value.stdout.decode("utf-8"))
else: print("Download Complete")
train_data = download_gdrive("10rDgl5zAvUdVgSoVngHfwJnf8I1tdpZi", print_stout=True)
test_data = download_gdrive("10qeDcgwdJC76Nv5cCj6WsUYjD6846fEL", print_stout=True)
#Read train and test sets
columns = ['polarity', 'tweetid', 'date', 'query_name', 'user', 'text']
dftrain = pd.read_csv('train_data.csv',
header = None,
encoding ='ISO-8859-1')
dftest = pd.read_csv('test_data.csv',
header = None,
encoding ='ISO-8859-1')
dftrain.columns = columns
dftest.columns = columns
#Sample 1M entries from the training set
dftrain = dftrain.sample(1000000)
#Remove punctuation on texts to just analyze text
def remove_punctuations(text):
for punctuation in string.punctuation:
text = text.replace(punctuation, '')
return text
#Replace url and username's links to tokens URL and USERNAME
class PrePreprocess(object):
user_pat = '(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9]+)'
http_pat = '(https?:\/\/(?:www\.|(?!www))[^\s\.]+\.[^\s]{2,}|www\.[^\s]+\.[^\s]{2,})'
repeat_pat, repeat_repl = "(.)\\1\\1+",'\\1\\1'
def __init__(self):
pass
def transform(self, X):
is_pd_series = isinstance(X, pd.core.frame.Series)
if not is_pd_series:
pp_text = pd.Series(X)
else:
pp_text = X
pp_text = pp_text.str.replace(pat = self.user_pat, repl = 'USERNAME')
pp_text = pp_text.str.replace(pat = self.http_pat, repl = 'URL')
pp_text.str.replace(pat = self.repeat_pat, repl = self.repeat_repl)
return pp_text
def fit(self, X, y=None):
return self
#Descriptive statistics with function that analyzes number of tokens
def descriptive_stats(tokens, top_num_tokens = 5, verbose=True) :
"""
Given a list of tokens, print number of tokens, number of unique tokens,
number of characters, lexical diversity
(https://en.wikipedia.org/wiki/Lexical_diversity),
and num_tokens most common tokens. Return a list with the number of
tokens, number of unique tokens, lexical diversity, and number of
characters.
"""
# Fill in the correct values here.
num_tokens = len(tokens)
num_unique_tokens = len(set(tokens))
lexical_diversity = num_unique_tokens/num_tokens
num_characters = len("".join(tokens))
if verbose :
print(f"There are {num_tokens} tokens in the data.")
print(f"There are {num_unique_tokens} unique tokens in the data.")
print(f"There are {num_characters} characters in the data.")
print(f"The lexical diversity is {lexical_diversity:.3f} in the data.")
# print the five most common tokens
index = pd.Index(tokens)
index.value_counts()
df = pd.DataFrame(index.value_counts())
top5 = df.head(top_num_tokens)
print(top5.index.tolist())
return([num_tokens, num_unique_tokens,
lexical_diversity,
num_characters])
def count_words(df, column='tokens', preprocess=None, min_freq=1):
# process tokens and update counter
def update(doc):
tokens = doc if preprocess is None else preprocess(doc)
counter.update(tokens)
# create counter and run through all data
counter = Counter()
df[column].map(update)
# transform counter into data frame
freq_df = pd.DataFrame.from_dict(counter, orient='index', columns=['freq'])
freq_df = freq_df.query('freq >= @min_freq')
freq_df.index.name = 'token'
return freq_df#.sort_values('freq', ascending=False)
def display_topics(model, features, no_top_words=5):
for topic, words in enumerate(model.components_):
total = words.sum()
largest = words.argsort()[::-1] # invert sort order
print("\nTopic %02d" % topic)
for i in range(0, no_top_words):
print(" %s (%2.2f)" % (features[largest[i]],
abs(words[largest[i]]*100.0/total)))
def wordcloud(word_freq, title=None, max_words=200, stopwords=None):
wc = WordCloud(width=800, height=400,
background_color= "black", colormap="Paired",
max_font_size=150, max_words=max_words)
# convert data frame into dict
if type(word_freq) == pd.Series:
counter = Counter(word_freq.fillna(0).to_dict())
else:
counter = word_freq
wc.generate_from_frequencies(counter)
plt.title(title)
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
def wordcloud_clusters(model, vectors, features, no_top_words=5):
for cluster in np.unique(model.labels_):
size = {}
words = vectors[model.labels_ == cluster].sum(axis=0).A[0]
largest = words.argsort()[::-1] # invert sort order
for i in range(0, no_top_words):
size[features[largest[i]]] = abs(words[largest[i]])
wc = WordCloud(background_color="white", max_words=100,
width=960, height=540)
wc.generate_from_frequencies(size)
plt.figure(figsize=(12,12))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
# Remove Stopwords
dftrain['text'] = dftrain['text'].apply(lambda x: ' '.join([word for word \
in x.split() if word not in (stop)]))
p = PrePreprocess()
#PRE-Process step
dftrain['tokens'] = p.transform(dftrain['text'])
dftrain['tokens'] = dftrain['tokens'].apply(remove_punctuations)
##word_tokenize
dftrain['tokens'] = dftrain.apply(lambda row: nltk.word_tokenize(row['tokens']),
axis=1)
print("DESCRIPTIVE STATS ON Tokens: ")
all = []
#on 100k data
#for li in dftrain['text'].sample(100).iteritems(): all += li[1]
#on all data
for li in dftrain['tokens'].iteritems(): all += li[1]
descriptive_stats(all, verbose=True)
print("\n")
print("DESCRIPTIVE STATS ON SENTIMENT POLARITY:")
dftrain['polarity'].describe()
dftrain.head(10)
#Identify positive and negative tweets based on polarity
#Count frequency of tokens used in each dataset
cv = CountVectorizer()
cv.fit(dftrain.text)
neg_doc_matrix = cv.transform(dftrain[dftrain.polarity == 0].text)
pos_doc_matrix = cv.transform(dftrain[dftrain.polarity == 4].text)
neg_tf = np.sum(neg_doc_matrix,axis=0)
pos_tf = np.sum(pos_doc_matrix,axis=0)
neg = np.squeeze(np.asarray(neg_tf))
pos = np.squeeze(np.asarray(pos_tf))
term_freq_df = pd.DataFrame([neg,pos],
columns=cv.get_feature_names()
).transpose()
term_freq_df.columns = ['negative', 'positive']
term_freq_df['total'] = term_freq_df['negative'] + term_freq_df['positive']
term_freq_df.sort_values(by='total',
ascending=False
).iloc[:10]
#Show top 50 negative tokens in tweets
y_pos = np.arange(50)
plt.figure(figsize=(12,10))
plt.bar(y_pos,
term_freq_df.sort_values(by='negative',ascending=False)
['negative'][:50],
align='center',
alpha=0.5)
plt.xticks(y_pos,
term_freq_df.sort_values(by='negative',ascending=False)
['negative']
[:50].index,
rotation='vertical')
plt.ylabel('Frequency')
plt.title('Top 50 tokens in Negative Tweets')
plt.xlabel('Top 50 Negative Tokens')
plt.show()
neg_word_freq = count_words(dftrain[dftrain["polarity"] == 0])
wordcloud(neg_word_freq['freq'], max_words=100,
stopwords=neg_word_freq.head(50).index)
#Show top 50 positive tokens in tweets
y_pos = np.arange(50)
plt.figure(figsize=(12,10))
plt.bar(y_pos,
term_freq_df.sort_values(by='positive',ascending=False)
['positive'][:50],
align='center',
alpha=0.5)
plt.xticks(y_pos,
term_freq_df.sort_values(by='positive',ascending=False)
['positive']
[:50].index,
rotation='vertical')
plt.ylabel('Frequency')
plt.xlabel('Top 50 Positive Tokens')
plt.title('Top 50 tokens in Positive tweets')
plt.show()
pos_word_freq = count_words(dftrain[dftrain["polarity"] == 4])
wordcloud(pos_word_freq['freq'], max_words=100,
stopwords=pos_word_freq.head(50).index)
#Trained model with logic regression algorithm
sentiment_lr = Pipeline([('pre_processor', p),
('cvect', CountVectorizer(min_df = 50)),
("scaler", preprocessing.StandardScaler(with_mean=False)),
('lr', LogisticRegression())])
sentiment_lr.fit(dftrain.text, dftrain.polarity)
#all test
Xtest, ytest = dftest.text, dftest.polarity
print(classification_report(ytest,sentiment_lr.predict(Xtest), labels=np.unique(sentiment_lr.predict(Xtest))))
# test
class FinalModel:
def __init__(self, sentiment_model, dangerous_tags):
self.model = sentiment_model ##LR
self.tags = dangerous_tags ##LR
def predict(self, text):
out = self.model.predict_proba(text)
print(out)
score = 0
if out[0][0] > 0.60: score = -1
elif out[0][1] > 0.60: score = 1
tok = self.model.named_steps['cvect'].transform([text])
word_list = self.model.named_steps['cvect'].get_feature_names_out()
count_list = tok.toarray().sum(axis=0)
o = dict(zip(word_list,count_list)) #this is the features used to predict!
flag = False
print(text)
for char in text.split( ):
if char in self.tags: flag = True
return (score, flag)
dangerous_tags = {"kill","killed", "shoot", "attack", "hurt","gun","guns", \
"weapon", "die", "bleed", "suicide","shooting","rifle", "choke", \
"punch","massacre","shooting","pain","revenge","bomb", \
"destroy","Stick","Knife","Blade","Club","Ax","Sword",\
"Spear","Halberd","Pike","Lance","Revolver","Rifle", \
"Shotgun","Semi Automatic Gun","Fully Automatic Gun",\
"Machine Gun","Crossbow","Flamethrower","Grenade",\
"Nerve Gas","Mustard Gas","Tear Gas","Pepper Spray", "AR15","AR-15"}
model = FinalModel(sentiment_model=sentiment_lr, dangerous_tags=dangerous_tags)
text = "Hello big beautiful world"
print(text)
pred = model.predict(text)
print("Sentiment score: {} \t is it dangerous: {}".format(pred, \
True if pred[0]==-1 and pred[1]==True else False))
print("\n")
text = "I hate this stupid world!!"
print(text)
pred = model.predict(text)
print("Sentiment score: {} \t is it dangerous: {}".format(pred, \
True if pred[0]==-1 and pred[1]==True else False))
print("\n")
text = "I will hurt you tomorrow"
print(text)
pred = model.predict(text)
print("Sentiment score: {} \t is it dangerous: {}".format(pred, \
True if pred[0]==-1 and pred[1]==True else False))
print("\n")
text = "That movie killed me! It was great"
print(text)
pred = model.predict(text)
print("Sentiment score: {} \t is it dangerous: {}".format(pred, \
True if pred[0]==-1 and pred[1]==True else False))
print("\n")
text = "Im going to punch the stupid teacher tomorrow!"
print(text)
pred = model.predict(text)
print("Sentiment score: {} \t is it dangerous: {}".format(pred, \
True if pred[0]==-1 and pred[1]==True else False))
print("\n")
text = "I like fruit punch" #testing bad spelling
print(text)
pred = model.predict(text)
print("Sentiment score: {} \t is it dangerous: {}".format(pred, \
True if pred[0]==-1 and pred[1]==True else False))
print("\n")
text = "i enjoy to kill zombies on my playstation!" #testing bad spelling
print(text)
pred = model.predict(text)
print("Sentiment score: {} \t is it dangerous: {}".format(pred, \
True if pred[0]==-1 and pred[1]==True else False))
print("\n")
text = "Just got out of shooting practice" #testing bad spelling
print(text)
pred = model.predict(text)
print("Sentiment score: {} \t is it dangerous: {}".format(pred, \
True if pred[0]==-1 and pred[1]==True else False))
print("\n")
text = "Im bringing my AR15 to hurt everyone" #testing bad spelling
print(text)
pred = model.predict(text)
print("Sentiment score: {} \t is it dangerous: {}".format(pred, \
True if pred[0]==-1 and pred[1]==True else False))
print("\n")
text = "I like shooting my AR-15 after school" #testing bad spelling
print(text)
pred = model.predict(text)
print("Sentiment score: {} \t is it dangerous: {}".format(pred, \
True if pred[0]==-1 and pred[1]==True else False))
print("\n")
text = "I am going to shoot you" #testing bad spelling
print(text)
pred = model.predict(text)
print("Sentiment score: {} \t is it dangerous: {}".format(pred, \
True if pred[0]==-1 and pred[1]==True else False))
print("\n")
text = "I was going shooting tommorow but I hurn my hand." #testing bad spelling
print(text)
pred = model.predict(text)
print("Sentiment score: {} \t is it dangerous: {}".format(pred, \
True if pred[0]==-1 and pred[1]==True else False))
print("\n")
The results of the classification model were surprisingly good. We decided to go with a simpler approach to detecting dangerous tweets and in the end, we thought it worked well for our purposes. When we looked at our test tweets the model seemed to do a good job of deciding what tweets we could consider dangerous and ones we could not. The model did a good job if there was just one topic involved in the tweet, but in testing we were able to confuse the model with tweets that discussed multiple topics which included dangerous words but the two statements were not related. For example, when we looked at a tweet stating “I was going to go shooting tomorrow but I hurt my hand” it considered this a dangerous tweet because we have multiple dangerous words in the statement but referring to different things. So, in this case the model took “I was going to go shooting tomorrow but I hurt my hand” and equated it to a statement such as “I am going to shoot you tomorrow and it will hurt”, which is obviously a very different statement.
In the classification performance, our indexes indicated 0 and 4; negative and positive. The dataset also has the number 2 included, since the threshold we selected on whether a tweet is positive or negative, was 0.6. So, because of this threshold, we converted this to a binary problem, in turn, converting to neutral messages. Therefore, for the scope of this project, we wanted to know which tweets are negative, and from those, we were going to find which were dangerous based on the tags we selected. This would help us identify some of the dangerous tags. Having a prediction of 0.75 is relatively high for a text mining project of this magnitude.
Tuning hyperparameters and exploring different models can improve predictions on whether tweets are positive or negative.
Our logistic regression did not have any hyperparameters tuned, but even then the negative tweets had a better performance that we initially expected. Our performance detected how many times our positive predictions were correct. In this case, we saw that our positive tweets did not perform that well, but for the scope of this project it does not create a big concern. Recall tells us that from all actual positives how many are predictive positive. Both negative and positive tweets provided an acceptable result, and to finalize with f1 scores (the harmonic mean between precision and recall) that can be also improve after changing the hyperparameters.
# Taking a sample from the train data for the topic modelling
Topic_Data = dftrain.sample(10000)
# TF-IDF vectorization
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
tfidf_text_vectorizer = TfidfVectorizer(stop_words=stop, min_df=5, max_df=0.7)
tfidf_text_vectors = tfidf_text_vectorizer.fit_transform(Topic_Data['text'])
tfidf_text_vectors.shape
k_means_text = KMeans(n_clusters=5, random_state=42)
model = k_means_text.fit(tfidf_text_vectors)
np.unique(model.labels_, return_counts=True)
sizes = []
for i in range(5):
sizes.append({"cluster": i, "size": np.sum(model.labels_==i)})
pd.DataFrame(sizes).set_index("cluster").plot.bar(figsize=(16,9))
plt.xticks(rotation=0)
plt.title('Cluster by Size')
plt.ylabel('size')
plt.show()
wordcloud_clusters(model, tfidf_text_vectors,
tfidf_text_vectorizer.get_feature_names())
# Fitting an LSA Model
svd_text_model = TruncatedSVD(n_components = 5, random_state=42)
W_svd_text_matrix = svd_text_model.fit_transform(tfidf_text_vectors)
H_svd_text_matrix = svd_text_model.components_
# call display_topics
display_topics(svd_text_model, tfidf_text_vectorizer.get_feature_names())
The objective of LSA is reducing the overall dimension for classification. Initially, we did not feel that topic modelling was the best approach for trying to identify dangerous tweets as tweets are not like documents in the sense that they are essentially small and usually unstructured text focused on one specific topic that is likely hard to derive without first grouping texts and responses together in one document. In our dataset, the relationship between texts is unknown and any topic pattern is likely lost. The results of the topic model were as expected with no real discernable topics emerging. For example, our first topic has keywords good, day, http (which means there was originally a link there), work and go. So, in reality, we think that we are essentially just seeing summaries of the tweet. This was expected as, like we mentioned before, tweets are not really documents and are very short and narrowly focused, so this type of topic modelling really does not work well here. In hind site, because LSA focusses on dimension reduction perhaps a different model such as Latent Dirichlet allocation, or LDA, would have been a better choice, although we feel the outcome would have been similar.
Since we used sentiment analysis in conjunction with tags we deemed dangerous to flag specific posts, we believe the most obvious next step would be to add more dangerous tags to the model. We added tags we felt were dangerous but given more time we could have done a more thorough analysis on tweets to compile dangerous tags as I am sure there are much more out there, many of which we may not familiar with. Additionally, because we were focused specifically on violence, we could expand the model to include hate speech or other more subtle forms of speech which may not necessarily be identified as a direct threat of violence but could be forms of bullying or intimidation.
Another enhancement would be more tuning of the model hyperparameters or building different models based on performance. Due to time constraints, we built a simple effective model but given more time we could expand the current model or explore more complex models.
Lastly, as a future enhancement we could add functionality to act on dangerous tweets as well. For example, if the model determines a series of tweets to be dangerous the program could then suspend the users account and/or notify authorities. The model would have to be well tested and very good at determining dangerous tweets at this point however because we do not want an automated process going rampant on twitter shutting down peoples accounts and calling the authorities on them because they happen to be talking about a hunting trip or accidently cutting themselves with a knife while making dinner.