I've touched Python, but I have little experience in implementing machine learning. → I especially want to make something using natural language processing With that feeling, I built a buzz estimation model.
Obtain it from the Twitter account (Qiita Popular Posts) that introduces trending articles using the Twitter API. 3229 data will be collected, the URL and emoji in the tweet will be removed, and then dumped to the json file.
def retrieveTweets(screenName, count):
global totalIdx
timeLine = t.statuses.user_timeline(screen_name=screenName, count=count)
maxId = 0
for tweetsIdx, tweet in enumerate(timeLine):
maxId = tweet["id"]
addArticleTitles(tweet)
totalIdx += 1
print("Starting additional retrieving...")
retrieveContinuedTweets(screenName, count, maxId)
def retrieveContinuedTweets(screenName, count, maxId):
global totalIdx, isFinished
tmpMaxId = maxId
while True:
timeLine = t.statuses.user_timeline(screen_name=screenName, count=count, max_id=tmpMaxId)
prevMaxId = 0
for tweetsIdx, tweet in enumerate(timeLine):
tmpMaxId = tweet["id"]
addArticleTitles(tweet)
print("totalIdx = {}, prevMaxId = {}, maxId = {}, title = {}\n".format(totalIdx, prevMaxId, tmpMaxId, trendArticleTitles[totalIdx]["articleTitle"]))
if prevMaxId == 0 and totalIdx % 200 != 0:
isFinished = True
break
prevMaxId = tmpMaxId
totalIdx += 1
if isFinished:
print("Finished collecting {} qiita_trend_titles.".format(totalIdx))
break
def addArticleTitles(tweet):
global trendArticleTitles
tmpTitle = re.sub(r"(https?|ftp)(:\/\/[-_\.!~*\'()a-zA-Z0-9;\/?:\@&=\+\$,%#]+)", "", tweet["text"]) #Remove URLs in tweets
tmpTitle = ''.join(s for s in tmpTitle if s not in emoji.UNICODE_EMOJI)
articleTitle = tmpTitle[:len(tmpTitle)-1] #Remove the trailing half-width space
datum = {"articleTitle": articleTitle}
trendArticleTitles.append(datum)
For regular article titles that are not buzzed, use the Qiita API to get them. Here, 9450 data were collected and dumped to a json file as well as the trend article title.
articleTitles = []
idx = 0
print("Starting collecting article titles...")
for page in range(3, 101):
#Exclude early pages to exclude articles from spam accounts
params = {"page": str(page), "per_page": str(per_page)}
response = requests.get(url, headers=headers, params=params)
resJson = response.json()
for article in resJson:
if article.get("likes_count") < notBuzzThreshold:
title = article.get("title")
articleTitles.append({"articleTitle": title})
print("{}th article title = {}, url = {}".format(idx, title, article["url"]))
idx += 1
print("Finished collecting {} qiita_article_titles.".format(idx))
First, load the two types of article title data collected above. While adding a flag as to whether it is a trend article, we will finish it as a single piece of data. Just in case, shuffle the contents of the combined data.
Here too, the json file is dumped at the end, and the data collection is finished.
mergedData = []
for datum in trendData:
mergedData.append({
"articleTitle": datum["articleTitle"],
"isTrend": 1
})
for datum in normalData:
mergedData.append({
"articleTitle": datum["articleTitle"],
"isTrend": 0
})
#Shuffle the order of the combined results
random.shuffle(mergedData)
print("Finished shuffling 'Merged Article Titles'.")
I tried to build an inference model using Naive Bayes, but I wasn't sure what to start with. Therefore, I reviewed Naive Bayes itself and tried an article that implements spam detection in Naive Bayes so that I could get a feel for it before this implementation.
Chap2_SpamDetection.md It was explained using an actual example, and I was able to confirm the part such as "What was Naive Bayes in the first place?"
[WIP] Introduction because the naive Bayes classifier is not simple at all In the article that shows the color of the formula, I was able to confirm what kind of calculation is done.
Now that I've learned a lot about Naive Bayes, I've moved on to practice implementation. I proceeded along the ↓. Machine learning-Junk mail classification (naive bayes classifier)-
Now that you have a feel for Naive Bayes, it's time to get into the main subject. I will write about the parts that have been modified from the implementation of the article used in the practice.
Since the spam detection dataset is in English, it will be thrown to scikit-learn as it is, but the Qiita article title does not. First, add MeCab and ipadic-NEologd so that you can divide words well in Japanese. (The result of the division was obtained with CountVectorizer, but it was unnatural.)
I mainly referred to the site below.
From the implementation of spam detection practice, we have added:
def getStopWords():
stopWords = []
with open("./datasets/Japanese.txt", mode="r", encoding="utf-8") as f:
for word in f:
if word != "\n":
stopWords.append(word.rstrip("\n"))
print("amount of stopWords = {}".format(len(stopWords)))
return stopWords
def removeEmoji(text):
return "".join(ch for ch in text if ch not in emoji.UNICODE_EMOJI)
stopWords = getStopWords()
tagger = MeCab.Tagger("mecabrc")
def extractWords(text):
text = removeEmoji(text)
text = neologdn.normalize(text)
words = []
analyzedResults = tagger.parse(text).split("\n")
for result in analyzedResults:
splittedWord = result.split(",")[0].split("\t")[0]
if not splittedWord in stopWords:
words.append(splittedWord)
return words
If you pass the word splitting method to the argument analyzer of CountVectorizer, it seems that Japanese will be split well. great.
vecCount = CountVectorizer(analyzer=extractWords, min_df=3)
We have prepared three texts for prediction: " I released the app "
, " Unity tutorial "
, " Git command memo "
.
It is assumed that "I tried to release the app" is "likely to buzz".
Obviously the number of words is small. I feel that it has not been divided normally.
word size: 1016
word content: {'From': 809, 'ms': 447, 'nginx': 464, 'django': 232, 'intellij': 363}
Train accuracy = 0.771
Test accuracy = 0.747
[0 0 0]
It seems that the words can be divided, but there are many words ... The classification is as expected.
word size: 3870
word content: {'From': 1696, 'MS': 623, 'Teams': 931, 'To': 1853, 'notification': 3711}
Train accuracy = 0.842
Test accuracy = 0.783
[1 0 0]
The number of words has been reduced, and the accuracy of test data has increased slightly. I felt the importance of pretreatment.
word size: 3719
word content: {'MS': 623, 'Teams': 931, 'To': 1824, 'notification': 3571, 'To do': 1735}
Train accuracy = 0.842
Test accuracy = 0.784
[1 0 0]
The accuracy for the training data has decreased slightly, but the accuracy for the test data has increased accordingly. Furthermore, I forgot to display the probability of classification, so I will display it here. The text that I assumed to be buzzing was honestly surprised with a higher probability than I expected. (It's unreliable unless you try it with more texts ...)
word size: 3700
word content: {'MS': 648, 'Teams': 955, 'To': 1838, 'notification': 3583, 'To do': 1748}
[1 0 0]
[[0.23452364 0.76547636]
[0.92761086 0.07238914]
[0.99557625 0.00442375]]
Train accuracy = 0.841
Test accuracy = 0.785
This time, it seemed that the change in accuracy was within the margin of error. Since I could only guarantee the minimum terms contained in NEologd, I thought that the accuracy could be improved by covering the vectorization of technical terms. After that, it seems that accuracy will improve even if you extract important words from the article title and article content with TF-IDF etc. and utilize them.
Recommended Posts