I made Hee-AI.
This Google Colab is published in a non-editable state. Please copy it to your drive and play with it! (PC environment recommended)
The source code is also available, so please refer to it for reference.
This is all happening on June 26, 2020.
My hobby is exploring Wikipedia, but one day [Trivia Fountain](https://ja.wikipedia.org/wiki/%E3%83%88%E3%83%AA%E3%83%93% E3% 82% A2% E3% 81% AE% E6% B3% 89_% E3% 80% 9C% E7% B4% A0% E6% 99% B4% E3% 82% 89% E3% 81% 97% E3% 81% 8D% E3% 83% A0% E3% 83% 80% E7% 9F% A5% E8% AD% 98% E3% 80% 9C) All past trivia featured on the show and their likes I found out that is listed.
Without hesitation, I thought, "This will be a regression problem ... !!!".
I launched the desktop immediately.
Eventually I wanted a lot of people to play, so at first I thought about making it a web service, but I didn't want to stop my research for more than a day, so I gave it up. Run it on Google Colab. By making it possible, I could easily play it if I had it copied on each drive, so I made a plan on that route.
Google Colab Google Colab (Google Colaboratory) is a service provided by Google that allows you to run Jupyter Notebook on Google Drive on Google's machine. You can also use GPU. It's very useful because you don't have to calculate locally. However, since it is a Jupyter Notebook, it is not suitable for people like me who love hierarchical structures.
But this time I can't say that. Moreover, if possible, you should execute one cell to complete everything and reduce the burden on the user as much as possible. However, make the learning process and the inference process the same cell. In that case, learning goes around every time you infer, so you need to separate only that part. Therefore, as a completed form, we planned Google Colab consisting of only two cells as shown below.
--Learning cell --Inference cell
And I wanted to publish it with the processing that can be omitted as much as possible. Therefore, I decided not to clone the Github repository with a directory structure on Google Golab, but to make it complete except for the install system. Therefore, Wikipedia data etc. are also stored in the python dictionary type first, and a python package called stickytape is used to make one python file, and various ideas have been devised.
So, it became a strong restriction play to make an AI application only with python on Colab, so at the currently held Qiita Summer Festival 2020 I also decided to apply.
In this project, we are looking for articles based on the following three themes.
I started writing thinking that I was 1., but while I was doing it, I thought that I might be able to touch the machine learning anti-pattern of 3. So I think I'll put a topic on that as well.
Story was it.
Let's go back to the production story.
It's scraping. At first, I copied HTML and struggled for about 15 minutes to see if it could be automatically formatted with vim macros, but there were some outliers, and I wondered if programming was better to deal with them. I gave up.
The scraping code I actually wrote looks like this.
scraping.py
import urllib
from bs4 import BeautifulSoup
URL = "https://ja.wikipedia.org/wiki/%E3%83%88%E3%83%AA%E3%83%93%E3%82%A2%E3%81%AE%E6%B3%89_%E3%80%9C%E7%B4%A0%E6%99%B4%E3%82%89%E3%81%97%E3%81%8D%E3%83%A0%E3%83%80%E7%9F%A5%E8%AD%98%E3%80%9C"
def get_text(tag):
text = tag.text
text = text.replace('\n', '')
text = text.replace('[18]', '')
text = text.replace('[19]', '')
text = text.replace('[20]', '')
text = text.replace('[21]', '')
text = text.replace('[22]', '')
return text
if __name__ == "__main__":
html = urllib.request.urlopen(URL)
soup = BeautifulSoup(html, 'html.parser')
trivia_table = soup.find('table', attrs={'class': 'sortable'})
trivias_list = []
for i, line in enumerate(trivia_table.tbody):
if i < 3:
continue
if line == '\n':
continue
id = line.find('th')
content, hee, man_hee = line.find_all('td')
id, content, hee, man_hee = map(get_text, [id, content, hee, man_hee])
if hee == '?':
continue
trivias_list.append({'id': id, 'content': content, 'hee': int(hee), 'man_hee': int(man_hee)})
print(trivias_list)
Roughly speaking,
--Use BeautifulSoup's soup.find ()
to find the table with all the trivia on the wikipedia trivia page.
--Skip unrelated lines at continue
. Read line by line while performing exception handling.
--Since there are meaningless character strings such as annotations even if you get the desired line, format it with the get_text
function.
--Store id, sentences, number of hell, etc. in trivial_list
.
It is a flow.
Now that we have all the data, it's a machine learning shop's showcase, feature engineering. I knew it before I did it, but I couldn't get any accuracy. There are 1000 sentences of about 20 characters at most. However, it is absolutely impossible to do it. However, in order to do the least reduction, we extracted the features as follows.
--Sentence length --Number of words --Number of hiragana --Number of katakana --Number of kanji --Number of English
is.
This is the objective variable for the last he number / full he
. If the number of he is used as it is, there were times when 200 he was full at the Trivia Fountain special meeting. The scale will be different. Therefore, the objective variable is the value normalized to 0-1 such as number of hairs / full hair
.
The code looks like this.
feature.py
import re
import MeCab
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
# Mecab
tagger = MeCab.Tagger('-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd')
tagger.parse('')
# lambda
re_hira = re.compile(r'^[Ah-Hmm]+$')
re_kata = re.compile(r'[\u30A1-\u30F4]+')
re_kanj = re.compile(r'^[\u4E00-\u9FD0]+$')
re_eigo = re.compile(r'^[a-zA-Z]+$')
is_hira = lambda word: not re_hira.fullmatch(word) is None
is_kata = lambda word: not re_kata.fullmatch(word) is None
is_eigo = lambda word: not re_eigo.fullmatch(word) is None
is_kanj = lambda word: not re_kanj.fullmatch(word) is None
# tl: trivias_list
def normalize_hee(tl):
for i in range(len(tl)):
tl[i]['norm_hee'] = tl[i]['hee'] / tl[i]['man_hee']
return tl
def wakati(text):
node = tagger.parseToNode(text)
l = []
while node:
if node.feature.split(',')[6] != '*':
l.append(node.feature.split(',')[6])
else:
l.append(node.surface)
node = node.next
return ' '.join(l)
def preprocess(tl):
tl = normalize_hee(tl)
for i in tqdm(range(len(tl))):
tl[i]['wakati_content'] = wakati(tl[i]['content'])
return tl
def count_len(sentence):
return len(sentence)
def count_word(sentence):
return len(sentence.split(' '))
def count_kata(sentence):
cnt = 0; total=0
for word in sentence.split(' '):
if word == '': continue
total += 1
if is_kata(word): cnt += 1
return cnt/total
def count_hira(sentence):
cnt = 0; total=0
for word in sentence.split(' '):
if word == '': continue
total += 1
if is_hira(word): cnt += 1
return cnt/total
def count_eigo(sentence):
cnt = 0; total=0
for word in sentence.split(' '):
if word == '': continue
total += 1
if is_eigo(word): cnt += 1
return cnt/total
def count_kanj(sentence):
cnt = 0; total=0
for word in sentence.split(' '):
if word == '': continue
total += 1
if is_kanj(word): cnt += 1
return cnt/total
def get_features(trivias_list, content=None, mode='learn'):
trivias_list = preprocess(trivias_list)
trivias_df = pd.DataFrame(trivias_list)
wakati_contents_list = trivias_df['wakati_content'].values.tolist()
word_vectorizer = TfidfVectorizer(max_features=5)
word_vectorizer.fit(wakati_contents_list)
if mode == 'inference':
content = [{'content': content, 'wakati_content': wakati(content)}]
content_df = pd.DataFrame(content)
wakati_content_list = content_df['wakati_content'].values.tolist()
tfidf = word_vectorizer.transform(wakati_content_list)
content_df = pd.concat([
content_df,
pd.DataFrame(tfidf.toarray())
], axis=1)
num_len_df = content_df['wakati_content'].map(count_len)
num_word_df = content_df['wakati_content'].map(count_word)
num_hira_df = content_df['wakati_content'].map(count_hira)
num_kata_df = content_df['wakati_content'].map(count_kata)
num_eigo_df = content_df['wakati_content'].map(count_eigo)
num_kanj_df = content_df['wakati_content'].map(count_kanj)
content_df['num_len'] = num_len_df.values.tolist()
content_df['num_word'] = num_word_df.values.tolist()
content_df['num_hira'] = num_hira_df.values.tolist()
content_df['num_kata'] = num_kata_df.values.tolist()
content_df['num_eigo'] = num_eigo_df.values.tolist()
content_df['num_kanj'] = num_kanj_df.values.tolist()
content_df = content_df.drop('content', axis=1)
content_df = content_df.drop('wakati_content', axis=1)
return content_df
tfidf = word_vectorizer.transform(wakati_contents_list)
all_df = pd.concat([
trivias_df,
pd.DataFrame(tfidf.toarray())
], axis=1)
num_len_df = all_df['wakati_content'].map(count_len)
num_word_df = all_df['wakati_content'].map(count_word)
num_hira_df = all_df['wakati_content'].map(count_hira)
num_kata_df = all_df['wakati_content'].map(count_kata)
num_eigo_df = all_df['wakati_content'].map(count_eigo)
num_kanj_df = all_df['wakati_content'].map(count_kanj)
all_df['num_len'] = num_len_df.values.tolist()
all_df['num_word'] = num_word_df.values.tolist()
all_df['num_hira'] = num_hira_df.values.tolist()
all_df['num_kata'] = num_kata_df.values.tolist()
all_df['num_eigo'] = num_eigo_df.values.tolist()
all_df['num_kanj'] = num_kanj_df.values.tolist()
if mode == 'learn':
all_df = all_df.drop('id', axis=1)
all_df = all_df.drop('hee', axis=1)
all_df = all_df.drop('man_hee', axis=1)
all_df = all_df.drop('content', axis=1)
all_df = all_df.drop('wakati_content', axis=1)
return all_df
I chose lightgbm as the model because it was annoying.
What can I use with other models, nn I have to normalize if I use it ... Hmm, gradient boosting!
It is a thought circuit. Lol
However, I did the minimum to do it.
--Evaluation index is MSE --Calculate MSE for 5 sets of test data with KFold --Hyperparameter search with optuna with the average value of 5 sets of MSE as the target value - max_depth: 1-20 - learning_rate: 0.001-0.1 - num_leaves: 2-70
The final best MSE was around 0.014
.
In other words, the square of the error for one trivia is about 0.014, so if you take the route, it is about 0.118, and since it was normalized, multiply by 100, 11.8.
Therefore, the error was about 11.8. (Hey ...)
The code looks like this.
train_lgb.py
import os
import optuna
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import KFold
from data.loader import load_data
from data.feature import get_features
from data.trivias_list import trivias_list
def objective(trial):
max_depth = trial.suggest_int('max_depth', 1, 20)
learning_rate = trial.suggest_uniform('learning_rate', 0.001, 0.1)
params = {
'metric': 'l2',
'num_leaves': trial.suggest_int("num_leaves", 2, 70),
'max_depth': max_depth,
'learning_rate': learning_rate,
'objective': 'regression',
'verbose': 0
}
mse_list = []
kfold = KFold(n_splits=5, shuffle=True, random_state=1)
for train_idx, valid_idx in kfold.split(X, y):
X_train = X.iloc[train_idx]
y_train = y.iloc[train_idx]
X_valid = X.iloc[valid_idx]
y_valid = y.iloc[valid_idx]
lgb_train = lgb.Dataset(X_train, y_train)
lgb_valid = lgb.Dataset(X_valid, y_valid)
model = lgb.train(params,
lgb_train,
valid_sets=lgb_valid,
verbose_eval=10,
early_stopping_rounds=30)
# f-measure
pred_y_valid = model.predict(X_valid, num_iteration=model.best_iteration)
true_y_valid = np.array(y_valid.data.tolist())
mse = np.sum((pred_y_valid - true_y_valid)**2) / len(true_y_valid)
mse_list.append(mse)
return np.mean(mse_list)
def build_model():
study = optuna.create_study()
study.optimize(objective, n_trials=500)
valid_split = 0.2
num_train = int((1-valid_split)*len(X))
X_train = X[:num_train]
y_train = y[:num_train]
X_valid = X[num_train:]
y_valid = y[num_train:]
lgb_train = lgb.Dataset(X_train, y_train)
lgb_valid = lgb.Dataset(X_valid, y_valid)
lgb_data = lgb.Dataset(X, y)
params = study.best_params
params['metric'] = 'l2'
model = lgb.train(params,
lgb_data,
valid_sets=lgb_valid,
verbose_eval=10,
early_stopping_rounds=30)
return model
X, y = load_data(trivias_list)
if __name__ == "__main__":
model = build_model()
content = 'The honey that bees collect over their lifetime is about a teaspoon.'
content_df = get_features(trivias_list, content=content, mode='inference')
output = model.predict(content_df)
hee = int(output*100)
print(f"{content}")
print(f"{hee}Hey")
In creating the machine learning model, I will describe the anti-patterns that failed. I hope it will be useful to you and yourself in the future.
As the scaling of the objective variable, I set it to he number / full he
. Fortunately, I was immediately careful, but at first I was fully prepared to use he number
as the objective variable. I continued coding as it was. If so, I would have given up saying, "Well, this is ..." without the accuracy.
I'm sorry. It's a very detailed story ... However, this is my second failure, so let me write ...
When embedding tfidf from a sentence, you need to take the following three steps.
#Instance generation
word_vectorizer = TfidfVectorizer(max_features=max_features)
#Enter the word-separated list of sentences,fit.
word_vectorizer.fit(wakati_contents_list)
#Embed the desired word-separated list of sentences.
tfidf = word_vectorizer.transform(wakati_contents_list)
Since tfidf has to consider how rare a word exists in the sentence from all the sentences, it is necessary to pass all the sentences to the vectorizer instance once. That is, "embed this sentence". You can't just pass one sentence.
However, I am not directly connected to the image that tfidf is a deterministic process and must be fitted, and I try to embed it with transform appropriately ...
Everyone should be careful.
I was wondering whether to write this because it is too basic, but it is a fact that it got stuck for about 1 minute, so I will write it.
Yes here.
all_df = all_df.drop('id', axis=1)
all_df = all_df.drop('hee', axis=1)
all_df = all_df.drop('man_hee', axis=1)
all_df = all_df.drop('content', axis=1)
all_df = all_df.drop('wakati_content', axis=1)
Information that is not entered in lightgbm, such as ʻid,
content` (original string), etc., should be properly deleted from the pandas data frame.
I think this is surprisingly easy to do.
In the process of generating features, we add hoihoi to the pandas data frame as shown below, so it's easy to forget that we don't need it. (Maybe because we are new to pandas. Deep I learned too much.)
all_df['num_eigo'] = num_eigo_df.values.tolist()
Well, as mentioned above, the finished product has been published on Google Colab.
I built it into a single code with stickytape
and devised it so that it can be played with an easy-to-understand UI even on Google Colab. There is an article I wrote about stickytape
yesterday, so please have a look.
Make multiple python files into one python file @wataoka
Now that I've made it, let's play with it myself. After learning, input the trivia that you have into Hee-AI and let it infer.
-"Toyotomi Hideyoshi is correct to read as Toyotomi Hideyoshi" ** 72 Hey **
Tough ...
-"Mr. Noppo has spoken." ** 81 Hey ** At the head family, it's 99 and it's the top in history ...
--"Sunny days are bright" ** 68 Hey ** Obviously, is it a little lower?
Overall, it's a tough evaluation. I never reached 90. Is 5 out of 5 Tamori-san?
Please let me know when anyone reaches 90.
If you write it at the beginning, it will get in the way, so let me introduce yourself quietly at the end.
name | Aki Wataoka |
---|---|
school | Kobe University Graduate School |
Undergraduate research | Machine learning,Speech processing |
Graduate study | Machine learning,fairness,Generative model, etc |
@Wataoka_Koki |
Follow us on Twitter!
Recommended Posts