I've touched Python, but I have little experience in implementing machine learning. → I especially want to make something using natural language processing With that feeling, I built a buzz estimation model.

Overview

Collect article title data
Combine article title datasets into one file
Practice first with spam detection
Replaced with Qiita article dataset

Implementation

Collect article title data

Articles on the trend

Obtain it from the Twitter account (Qiita Popular Posts) that introduces trending articles using the Twitter API. 3229 data will be collected, the URL and emoji in the tweet will be removed, and then dumped to the json file.

def retrieveTweets(screenName, count):
    global totalIdx
    timeLine = t.statuses.user_timeline(screen_name=screenName, count=count)
    maxId = 0
    for tweetsIdx, tweet in enumerate(timeLine):
        maxId = tweet["id"]
        addArticleTitles(tweet)
        totalIdx += 1
    print("Starting additional retrieving...")
    retrieveContinuedTweets(screenName, count, maxId)

def retrieveContinuedTweets(screenName, count, maxId):
    global totalIdx, isFinished
    tmpMaxId = maxId
    while True:
        timeLine = t.statuses.user_timeline(screen_name=screenName, count=count, max_id=tmpMaxId)
        prevMaxId = 0
        for tweetsIdx, tweet in enumerate(timeLine):
            tmpMaxId = tweet["id"]
            addArticleTitles(tweet)
            print("totalIdx = {}, prevMaxId = {}, maxId = {}, title = {}\n".format(totalIdx, prevMaxId, tmpMaxId, trendArticleTitles[totalIdx]["articleTitle"]))
            if prevMaxId == 0 and totalIdx % 200 != 0:
                isFinished = True
                break
            prevMaxId = tmpMaxId
            totalIdx += 1
        if isFinished:
            print("Finished collecting {} qiita_trend_titles.".format(totalIdx))
            break

def addArticleTitles(tweet):
    global trendArticleTitles
    tmpTitle = re.sub(r"(https?|ftp)(:\/\/[-_\.!~*\'()a-zA-Z0-9;\/?:\@&=\+\$,%#]+)", "", tweet["text"]) #Remove URLs in tweets
    tmpTitle = ''.join(s for s in tmpTitle if s not in emoji.UNICODE_EMOJI)
    articleTitle = tmpTitle[:len(tmpTitle)-1] #Remove the trailing half-width space
    datum = {"articleTitle": articleTitle}
    trendArticleTitles.append(datum)

Regular article

For regular article titles that are not buzzed, use the Qiita API to get them. Here, 9450 data were collected and dumped to a json file as well as the trend article title.

Actually, there was a slight problem * When I got the new articles in order, the articles posted by the spam account were included. I was able to reorganize the code so that it would not be collected from the page near the top.

articleTitles = []
idx = 0
print("Starting collecting article titles...")
for page in range(3, 101):
    #Exclude early pages to exclude articles from spam accounts
    params = {"page": str(page), "per_page": str(per_page)}
    response = requests.get(url, headers=headers, params=params)
    resJson = response.json()
    for article in resJson:
        if article.get("likes_count") < notBuzzThreshold:
            title = article.get("title")
            articleTitles.append({"articleTitle": title})
            print("{}th article title = {}, url = {}".format(idx, title, article["url"]))
            idx += 1
print("Finished collecting {} qiita_article_titles.".format(idx))

Combine article title datasets into one file

First, load the two types of article title data collected above. While adding a flag as to whether it is a trend article, we will finish it as a single piece of data. Just in case, shuffle the contents of the combined data.

It may not be necessary because it is randomly divided by the data division process described later.

Here too, the json file is dumped at the end, and the data collection is finished.

mergedData = []
for datum in trendData:
    mergedData.append({
        "articleTitle": datum["articleTitle"],
        "isTrend": 1
    })
for datum in normalData:
    mergedData.append({
        "articleTitle": datum["articleTitle"],
        "isTrend": 0
    })

#Shuffle the order of the combined results
random.shuffle(mergedData)
print("Finished shuffling 'Merged Article Titles'.")

Practice first with spam detection

I tried to build an inference model using Naive Bayes, but I wasn't sure what to start with. Therefore, I reviewed Naive Bayes itself and tried an article that implements spam detection in Naive Bayes so that I could get a feel for it before this implementation.

Naive Bayes-Study

Chap2_SpamDetection.md It was explained using an actual example, and I was able to confirm the part such as "What was Naive Bayes in the first place?"
[WIP] Introduction because the naive Bayes classifier is not simple at all In the article that shows the color of the formula, I was able to confirm what kind of calculation is done.

Naive Bayes-Practice with Spam Detection

Now that I've learned a lot about Naive Bayes, I've moved on to practice implementation. I proceeded along the ↓. Machine learning-Junk mail classification (naive bayes classifier)-

Dataset used here: kaggle: SMS Spam Collection Dataset There are many implementations in Kernels that may be helpful, but this time I couldn't fully implement them ...

Replaced with Qiita article dataset

Now that you have a feel for Naive Bayes, it's time to get into the main subject. I will write about the parts that have been modified from the implementation of the article used in the practice.

Install MeCab, ipadic-NEologd

Since the spam detection dataset is in English, it will be thrown to scikit-learn as it is, but the Qiita article title does not. First, add MeCab and ipadic-NEologd so that you can divide words well in Japanese. (The result of the division was obtained with CountVectorizer, but it was unnatural.)

I mainly referred to the site below.

Morphological analysis with Python and MeCab (on Windows)
Relatively easy way to enter NEologd dictionary on Windows I ran it alternately at the command prompt of Windows 10 and Git Bash.
For commands with a tilde in the path, click here (https://oshiete.goo.ne.jp/qa/1246439.html) → I couldn't do it well & it was difficult to use if I put it directly in Windows, so I decided to install it via WSL.
Install mecab on ubuntu 18.10 How to install NEologd dictionary in MeCab for Windows and use it in Python When I was in college, I was sometimes asked, "Are you doing machine learning on Windows?", But I felt once again that Linux makes it easier to create an environment.

Model building

From the implementation of spam detection practice, we have added:

Changed ipadic-NEologd to use word splitting
Remove stopword I used Slothlib as a Japanese stopword.
Remove emoji
Various normalization processes I used convenient library neologdn of my ancestors. Very convenient.

def getStopWords():
    stopWords = []
    with open("./datasets/Japanese.txt", mode="r", encoding="utf-8") as f:
        for word in f:
            if word != "\n":
                stopWords.append(word.rstrip("\n"))
    print("amount of stopWords = {}".format(len(stopWords)))
    return stopWords

def removeEmoji(text):
    return "".join(ch for ch in text if ch not in emoji.UNICODE_EMOJI)

stopWords = getStopWords()
tagger = MeCab.Tagger("mecabrc")
def extractWords(text):
    text = removeEmoji(text)
    text = neologdn.normalize(text)
    words = []
    analyzedResults = tagger.parse(text).split("\n")
    for result in analyzedResults:
        splittedWord = result.split(",")[0].split("\t")[0]
        if not splittedWord in stopWords:
            words.append(splittedWord)
    return words

If you pass the word splitting method to the argument analyzer of CountVectorizer, it seems that Japanese will be split well. great.

vecCount = CountVectorizer(analyzer=extractWords, min_df=3)

Execution result

We have prepared three texts for prediction: " I released the app ", " Unity tutorial ", " Git command memo ". It is assumed that "I tried to release the app" is "likely to buzz".

Count Vectorizer without analyzer specified

Obviously the number of words is small. I feel that it has not been divided normally.

word size:  1016
word content:  {'From': 809, 'ms': 447, 'nginx': 464, 'django': 232, 'intellij': 363}
Train accuracy = 0.771
Test accuracy = 0.747
[0 0 0]

Designate MeCab NEoglod as morphological analyzer

It seems that the words can be divided, but there are many words ... The classification is as expected.

word size:  3870
word content:  {'From': 1696, 'MS': 623, 'Teams': 931, 'To': 1853, 'notification': 3711}
Train accuracy = 0.842
Test accuracy = 0.783
[1 0 0]

Remove stop words and emoji

The number of words has been reduced, and the accuracy of test data has increased slightly. I felt the importance of pretreatment.

word size:  3719
word content:  {'MS': 623, 'Teams': 931, 'To': 1824, 'notification': 3571, 'To do': 1735}
Train accuracy = 0.842
Test accuracy = 0.784
[1 0 0]

Added various normalization processes

The accuracy for the training data has decreased slightly, but the accuracy for the test data has increased accordingly. Furthermore, I forgot to display the probability of classification, so I will display it here. The text that I assumed to be buzzing was honestly surprised with a higher probability than I expected. (It's unreliable unless you try it with more texts ...)

word size:  3700
word content:  {'MS': 648, 'Teams': 955, 'To': 1838, 'notification': 3583, 'To do': 1748}
[1 0 0]
[[0.23452364 0.76547636]
 [0.92761086 0.07238914]
 [0.99557625 0.00442375]]
Train accuracy = 0.841
Test accuracy = 0.785

Consideration / issues, etc.

This time, it seemed that the change in accuracy was within the margin of error. Since I could only guarantee the minimum terms contained in NEologd, I thought that the accuracy could be improved by covering the vectorization of technical terms. After that, it seems that accuracy will improve even if you extract important words from the article title and article content with TF-IDF etc. and utilize them.

I tried to build an estimation model of article titles that are likely to buzz with Qiita