motivation

There are many connections, and we have decided to solve the Text Classification problem, which classifies "Is the student's answer a gold, silver, bronze, or out of range?"

In practice, I hadn't solved the task of natural language processing so much, so I decided to solve the problem of "Sentiment Analayis Emotion Analysis" that was exciting in Kaggle and practice it.

Through a lot of trial and error, I found that preprocessing (text processing) is important for natural language processing tasks (more than other tasks), so I will summarize it in this article.

Summary of data

Students and practitioners from all over the world are competing for their abilities, and the data set "Sentiment140 dataset with 1.6 million tweets" is shared from the site of the online data analysis competition called Kaggle, so use that data. To do.

This is not the data used in the competition, but a dataset shared by Kaggle volunteers, saying, "Please use it for sentiment analysis." With 1.6 million tweets, it's a fairly rich dataset. Below is a table with a summary of the data. [Dataset here]: https://www.kaggle.com/kazanova/sentiment140

When a task is given a text (tweet), is that tweet Positive? It is a problem of Binary Classification that predicts whether it is Negative.

Modeling strategy

We adopted the method of "fixing the model architecture and changing the preprocessing in various ways". (Actually, I tried various things) In this article, I will give priority to clarity and share it briefly with the following flow.

The Vanila LSTM is the most basic (classical) LSTM structure and is a model with the following features.

One hidden layer (LSTM Unit layer)
One Dense Layer for output

The code (Keras Sequential API) is as follows.

model = Sequential()
model.add(LSTM(32, activation='tanh', input_shape=(n_steps, n_features)))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])

[Reference: https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/]

Now, from the following, I will share about the actual implementation.

First, import the necessary libraries and data

from collections import defaultdict, Counter
import time
import re
import string
import pandas as pd

import nltk
from nltk.corpus import stopwords 
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import one_hot
from sklearn.model_selection import train_test_split

from keras.models import Sequential
from keras.layers import Embedding,LSTM, Dense
from keras.callbacks import EarlyStopping

df = pd.read_csv("twitter_sentiment_kaggle.csv", encoding="latin-1",header=0, 
                 names=['sentiment','id','date','flag','user','text'], usecols=["id", "sentiment", "text"])
print(df.head(2))

I don't think it's necessary to explain here, so I'll go next.

STEP1: Preprocessing to clean tweets version1

What does it mean to clean a tweet? That is, "remove letters and symbols that interfere with learning."

For this task, the model has to "read (link) emotions from tweets", so the information of "@username" and "URL" will be an obstacle.

In NLP tasks, such "work to delete symbols and characters that you do not want the model to learn" is generally called "preprocessing".

Well, the pretreatment I tried first this time is listed below. There is nothing special and all are basic pre-processing.

[** Pretreatment performed **]

Make all words lowercase
Delete the URL
Encode the emoji
Replace unnecessary symbols for modeling with blank "" - @usernames --Other than numbers and alphabets --Omit three or more consecutive characters to two (e.g. "awwwww" ---> "aww")
Tokenize the tweet and remove the stopwords and punctuation

def clean_text(text_data):  
    #Definitions of encodings, regular expressions, stopwords and punctuation--->Volunteers define it on the net, so let's borrow it.
    URL_PATTERN = r"((http://)[^ ]*|(https://)[^ ]*|( www\.)[^ ]*)"
    
    EMOJI_ENCODER_DICT = {':)': 'smile', ':-)': 'smile', ':))': 'smile', ';d': 'wink', ':-E': 'vampire', ':(': 'sad', ':-(': 'sad', ':-<': 'sad', ':P': 'raspberry', ':O': 'surprised',
                          ':-@': 'shocked', ':@': 'shocked',':-$': 'confused', ':\\': 'annoyed', ':#': 'mute', ':X': 'mute', ':^)': 'smile', ':-&': 'confused', 
                          '$_$': 'greedy','@@': 'eyeroll', ':-!': 'confused', ':-D': 'smile', ':-0': 'yell', 'O.o': 'confused','<(-_-)>': 'robot', 'd[-_-]b': 'dj', 
                          ":'-)": 'sadsmile', ';)': 'wink', ';-)': 'wink', 'O:-)': 'angel','O*-)': 'angel','(:-D': 'gossip', '=^.^=': 'cat'}
    
    USER_NAME_PATTERN = r'@[^\s]+'
    NON_ALPHA_PATTERN = r"[^A-Za-z0-9]"
    
    SEQUENCE_DETECT_PATTERN = r"(.)\1\1+"
    SEQUENCE_REPLACE_PATTERN = r"\1\1"
    
    ENGLISH_STOPWORDS = stopwords.words('english') 
    PUNCTUATIONS = string.punctuation.split()
    
    ###############################Pre-process tweets########################################
    clean_tweets = []
    for each_tweet in text_data:
        #Make all letters lowercase
        each_tweet = each_tweet.lower()
        
        #Clear URL
        each_tweet = re.sub(URL_PATTERN, "", each_tweet).strip()
        
        #Omit 3 or more consecutive characters to 2
        each_tweet = re.sub(SEQUENCE_DETECT_PATTERN, SEQUENCE_REPLACE_PATTERN, each_tweet)
        
        #Encode emoji
        for key in EMOJI_ENCODER_DICT.keys():
            each_tweet = each_tweet.replace(key, " EMOJI " + EMOJI_ENCODER_DICT[key])
        
        ###Delete various symbols that are not necessary for modeling###
        # ”@Delete usernames ”
        each_tweet = re.sub(USER_NAME_PATTERN, "", each_tweet)
        
        #Delete all but numbers and letters
        each_tweet = re.sub(NON_ALPHA_PATTERN, " ", each_tweet)
        
        ###Tokenize tweets(The element is a list of each word,)And remove the stop words and punctuation###
        tokenizer = nltk.TweetTokenizer(preserve_case=False, strip_handles=True,  reduce_len=True)
        tweet_tokens = tokenizer.tokenize(each_tweet)

        #Removed stopwords and punctuation
        clean_tweet_sentence = ' '
        for word in tweet_tokens: #Look at each word
            if (word not in ENGLISH_STOPWORDS and  word not in PUNCTUATIONS):
                clean_tweet_sentence += (word+' ')
                
        clean_tweets.append(clean_tweet_sentence)
    return clean_tweets
#########################################################################################
#Clean up tweets
t = time.time()
clean_tweets_list = clean_text(df["text"])
print(f'The tweets have become beautiful.')
print(f'Code execution time: {round(time.time()-t)} seconds')

#A clean tweet,'clean_tweet'Add to a new column as
df["clean_text"] = clean_tweets_list

#View results
print(df[["text", "clean_text"]].head(2))

Result after cleaning: URLs and @Username have been properly removed.

Step2: Encode characters

After cleaning the tweet in STEP 1 above, we will process the tweet to train the model. In other words, each letter (English word) contained in the tweet is represented by a number. The model can only recognize numbers, so it's a necessary task, isn't it?

The specific work done is summarized below.

** Hash (One Hot Encoding) each character contained in the tweet ** --Hashing means "assigning an index number to each character". --For example, "I love LSTM" is converted to [100, 240, 600]. --This preprocessing is required to correctly execute the Embedding function, which will be described later. --Also, the argument n that is required to be input by the keras one_hot function is the "vocabulary number". --This time, "Vocabulary number" n is the number of vocabulary (Number of unique words) of clean_tweet. --For example, if the number of clean_tweet vocabularies is 100,000, enter n = 100000. --This vocabulary number n is usually set to the same as input_dim, which is an argument of the Embedding function. --Documentation about keras.preprocessing.text.one_hot function: https://keras.io/ja/preprocessing/text/
** Apply Padding / Truncation to hashed tweets ** ――Padding / Truncating simply means "unifying the number of elements in the list". --For example, suppose two tweets, "I love LSTM" and "I prefer GRU over LSTM", are hashed to [100, 240, 600] and "100,250,900,760,600" respectively. ――In this case, the number of elements (number of Sequences) does not match, making learning with LSTM difficult. ――So, in order to align the number of elements to the specified number (for example, 5), it is necessary to perform padding / truncating. --Padding is to fill in the blanks with 0 if the specified number (e.g. 5) is not enough. An example of the I love LSTM would be [100, 240, 600, 0, 0] or [0, 0, 100, 240, 600]. --truncating means that if the number exceeds the specified number (e.g.3), the extra element is deleted. An example of I prefer GRU over LSTM would be "100,250,900" or "900,760,600". --Documentation about the keras.preprocessing.sequecne.pad_sequences function: https://keras.io/ja/preprocessing/sequence/

def encode_with_oneHot(text, total_vocab_freq, max_tweet_length):
    #One Hot encoding and Padding/Perform Truncating
    encoded_tweets_oneHot = []
    for each_tweet in text:
        each_encoded_tweet = one_hot(each_tweet, total_vocab_freq)
        encoded_tweets_oneHot.append(each_encoded_tweet)
    each_encoded_tweets_oneHot_pad = pad_sequences(encoded_tweets_oneHot, maxlen=max_tweet_length, 
                                                   padding="post", truncating="post")
    return each_encoded_tweets_oneHot_pad
###################################################################################################
###Encode clean tweets###
#Know the number of times a word appears
vocab_dict = defaultdict(int)
for each_t in df["clean_text_after_others"]:
    for w in each_t.split():
        vocab_dict[w] += 1
total_vocab_freq   = len(vocab_dict.keys())#Counting the total number of words

#Know the length of the sentence
sentence_length_dict = defaultdict(int)
for i, each_t in enumerate(df["clean_text_after_others"]):
    sentence_length_dict[i] = len(each_t.split())
max_tweet_length = max(sentence_length_dict.values())#Count the longest sentence

#Run
t = time.time()
one_hot_texts = encode_with_oneHot(df["clean_text"], total_vocab_freq, max_tweet_length)
print(f'One-hot encoding of tweets is over')
print(f'Code execution time: {round(time.time()-t)} seconds')

Execution result: Tweets are hashed appropriately line by line.

STEP3: Modeling

In STEP2, we were able to hash the tweets (quantify = One Hot encoding), so we are ready for modeling. Below is the code sharing.

embedding_length = 32
model = Sequential()
model.add(Embedding(input_dim=total_vocab_freq+1, output_dim=embedding_length, input_length=max_tweet_length, mask_zero=True))
model.add(LSTM(units=32))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

As I mentioned earlier in "Modeling Strategy", this is the classic LSTM structure, "Vanilla LSTM".

However, I added an embedding layer in front of the LSTM layer. The reason is that hashed tweets alone cannot read the semantic meaning.

This Embedding is a function that returns a matrix of any dimension with the hashed word as the key. Each element of this matrix is given a semantic meaning.

In other words, what does that mean? Words can be calculated like "king-man + woman = queen". Since it is important, I will repeat it, but since each word is given a matrix representation (Semantic Meaning), such an operation is possible.

Now the LSTM can learn the relationships between words.

Regarding Embedding, @ 9ryuuuuu's article is very easy to understand, so please refer to it. https://qiita.com/9ryuuuuu/items/e4ee171079ffa4b87424

Now let's shape the data and start training the model.

#Data formatting
y = pd.get_dummies(df["sentiment"]).values
X_train, X_test, y_train, y_test = train_test_split(one_hot_texts, y, test_size = 0.2, random_state = 123, stratify=y)
print(X_train.shape,y_train.shape)
print(X_test.shape,y_test.shape)

#Start learning
batch_size = 256
callback = [EarlyStopping(monitor='val_loss', patience=2,  verbose=1)]
hist = model.fit(X_train, y_train, epochs=5, batch_size=batch_size, callbacks=callback, verbose=1, validation_split=0.1)

#Show accuracy for validation data
import numpy as np
print("Validation Accuracy:",round(np.mean(hist.history['val_accuracy']), 4))

Confirmation of accuracy for verification data

It was 78%. This number is as accurate as other Kaggle DL implementations.

Machine learning forces that use tf-idf + n-grams as input are a little less accurate. In the range of observation, it is about 68 to 78%.

So, with this model, the deviation value is about 55 to 60. (Guess) [Reference: https://www.kaggle.com/kazanova/sentiment140/notebooks]

From here, just update the pre-processing and try to implement it again, without changing the model's architecture.

There were two major improvements.

Improvement 1

The first improvement is the handling of "Stop Words". Stop Words are words such as "not", "no", and "up", which are a group of words that are customarily erased in the world of natural language processing. However, as shown in the example below, as a result of deleting "no", the phenomenon that the meaning of the tweet is significantly different occurred, so I changed to preprocessing without deleting stop words. ..

Improvement point 2

The second improvement is about "the number of words in the tweet". After the pre-processing, there is a tweet with only one word, such as the tweet is only "play", so I deleted such data.

Results after improvement

The accuracy has improved. The result is 82% of the validation data. This is the result of the top DL implementations (deviation value 60 ~ 65?), So I'm happy.

Plan from now on

At the Bronze Master and Grand Master levels, I've achieved a record of 88-92%, so There is still room for improvement for me. I thought that many Grandmasters have implemented it with CNN + LSTM.

However, in this task, I have a policy of thoroughly preprocessing, and I think that the accuracy will reach 90%.

This is because there are still many areas that need to be improved by pretreatment. For example, the following preprocessing.

But this is actually quite difficult, isn't it? Is it English for each tweet? I have to make a judgment, but the accuracy of the library that makes the judgment is so bad that it is useless.

This is the sentence that is output by the langdetect.detect function, which is judged to be "not in English", but it clearly contains a sentence that is in English.

So automation is difficult, what should I do about this pre-processing? Pending. Perhaps kaggler also finds this difficult, no one (in my observation) has done this pretreatment.

So, I would like to continue the verification and update the article.

Thank you for watching so far.

In creating a model for discriminating tweet emotions with LSTM + Embedding, I reaffirmed the importance of preprocessing in NLP.