Hee-AI

I made Hee-AI.

This Google Colab is published in a non-editable state. Please copy it to your drive and play with it! (PC environment recommended)

Google Colab link

The source code is also available, so please refer to it for reference.

Production story

This is all happening on June 26, 2020.

Motivation

My hobby is exploring Wikipedia, but one day [Trivia Fountain](https://ja.wikipedia.org/wiki/%E3%83%88%E3%83%AA%E3%83%93% E3% 82% A2% E3% 81% AE% E6% B3% 89_% E3% 80% 9C% E7% B4% A0% E6% 99% B4% E3% 82% 89% E3% 81% 97% E3% 81% 8D% E3% 83% A0% E3% 83% 80% E7% 9F% A5% E8% AD% 98% E3% 80% 9C) All past trivia featured on the show and their likes I found out that is listed.

Without hesitation, I thought, "This will be a regression problem ... !!!". スクリーンショット 2020-06-26 14.55.12.png

I launched the desktop immediately.

plan

Eventually I wanted a lot of people to play, so at first I thought about making it a web service, but I didn't want to stop my research for more than a day, so I gave it up. Run it on Google Colab. By making it possible, I could easily play it if I had it copied on each drive, so I made a plan on that route.

Google Colab Google Colab (Google Colaboratory) is a service provided by Google that allows you to run Jupyter Notebook on Google Drive on Google's machine. You can also use GPU. It's very useful because you don't have to calculate locally. However, since it is a Jupyter Notebook, it is not suitable for people like me who love hierarchical structures.

But this time I can't say that. Moreover, if possible, you should execute one cell to complete everything and reduce the burden on the user as much as possible. However, make the learning process and the inference process the same cell. In that case, learning goes around every time you infer, so you need to separate only that part. Therefore, as a completed form, we planned Google Colab consisting of only two cells as shown below.

--Learning cell --Inference cell

And I wanted to publish it with the processing that can be omitted as much as possible. Therefore, I decided not to clone the Github repository with a directory structure on Google Golab, but to make it complete except for the install system. Therefore, Wikipedia data etc. are also stored in the python dictionary type first, and a python package called stickytape is used to make one python file, and various ideas have been devised.

Qiita Summer Festival 2020

So, it became a strong restriction play to make an AI application only with python on Colab, so at the currently held Qiita Summer Festival 2020 I also decided to apply.

In this project, we are looking for articles based on the following three themes.

If you want to make a △△ (app) now using only 〇〇 (language)
Share past failures in system development and how to overcome them!
[Machine learning] "Don't do" Let's share anti-patterns!

I started writing thinking that I was 1., but while I was doing it, I thought that I might be able to touch the machine learning anti-pattern of 3. So I think I'll put a topic on that as well.

Story was it.

Let's go back to the production story.

Scraping

It's scraping. At first, I copied HTML and struggled for about 15 minutes to see if it could be automatically formatted with vim macros, but there were some outliers, and I wondered if programming was better to deal with them. I gave up.

The scraping code I actually wrote looks like this.

`scraping.py`


import urllib
from bs4 import BeautifulSoup

URL = "https://ja.wikipedia.org/wiki/%E3%83%88%E3%83%AA%E3%83%93%E3%82%A2%E3%81%AE%E6%B3%89_%E3%80%9C%E7%B4%A0%E6%99%B4%E3%82%89%E3%81%97%E3%81%8D%E3%83%A0%E3%83%80%E7%9F%A5%E8%AD%98%E3%80%9C"

def get_text(tag):
    text = tag.text
    text = text.replace('\n', '')
    text = text.replace('[18]', '')
    text = text.replace('[19]', '')
    text = text.replace('[20]', '')
    text = text.replace('[21]', '')
    text = text.replace('[22]', '')
    return text

if __name__ == "__main__":

    html = urllib.request.urlopen(URL)
    soup = BeautifulSoup(html, 'html.parser')

    trivia_table = soup.find('table', attrs={'class': 'sortable'})

    trivias_list = []
    for i, line in enumerate(trivia_table.tbody):

        if i < 3:
            continue
        if line == '\n':
            continue
        
        id = line.find('th')
        content, hee, man_hee = line.find_all('td')

        id, content, hee, man_hee = map(get_text, [id, content, hee, man_hee])

        if hee == '?':
            continue
        
        trivias_list.append({'id': id, 'content': content, 'hee': int(hee), 'man_hee': int(man_hee)})
    
    print(trivias_list)

Roughly speaking,

--Use BeautifulSoup's soup.find () to find the table with all the trivia on the wikipedia trivia page. --Skip unrelated lines at continue. Read line by line while performing exception handling. --Since there are meaningless character strings such as annotations even if you get the desired line, format it with the get_text function. --Store id, sentences, number of hell, etc. in trivial_list.

It is a flow.

Feature engineering

Now that we have all the data, it's a machine learning shop's showcase, feature engineering. I knew it before I did it, but I couldn't get any accuracy. There are 1000 sentences of about 20 characters at most. However, it is absolutely impossible to do it. However, in order to do the least reduction, we extracted the features as follows.

--Sentence length --Number of words --Number of hiragana --Number of katakana --Number of kanji --Number of English

tfidf ――Number of hey / full

is.

This is the objective variable for the last he number / full he. If the number of he is used as it is, there were times when 200 he was full at the Trivia Fountain special meeting. The scale will be different. Therefore, the objective variable is the value normalized to 0-1 such as number of hairs / full hair.

The code looks like this.

`feature.py`


import re

import MeCab
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# Mecab
tagger = MeCab.Tagger('-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd')
tagger.parse('')

# lambda
re_hira = re.compile(r'^[Ah-Hmm]+$')
re_kata = re.compile(r'[\u30A1-\u30F4]+')
re_kanj = re.compile(r'^[\u4E00-\u9FD0]+$')
re_eigo = re.compile(r'^[a-zA-Z]+$')
is_hira = lambda word: not re_hira.fullmatch(word) is None
is_kata = lambda word: not re_kata.fullmatch(word) is None
is_eigo = lambda word: not re_eigo.fullmatch(word) is None
is_kanj = lambda word: not re_kanj.fullmatch(word) is None

# tl: trivias_list
def normalize_hee(tl):
    for i in range(len(tl)):
        tl[i]['norm_hee'] = tl[i]['hee'] / tl[i]['man_hee']
    return tl

def wakati(text):

    node = tagger.parseToNode(text)
    l = []
    while node:
        if node.feature.split(',')[6] != '*':
            l.append(node.feature.split(',')[6])
        else:
            l.append(node.surface)
        node = node.next
    return ' '.join(l)

def preprocess(tl):
    tl = normalize_hee(tl)
    for i in tqdm(range(len(tl))):
        tl[i]['wakati_content'] = wakati(tl[i]['content'])
    return tl

def count_len(sentence):
    return len(sentence)
def count_word(sentence):
    return len(sentence.split(' '))
def count_kata(sentence):
    cnt = 0; total=0
    for word in sentence.split(' '):
        if word == '': continue
        total += 1
        if is_kata(word): cnt += 1
    return cnt/total
def count_hira(sentence):
    cnt = 0; total=0
    for word in sentence.split(' '):
        if word == '': continue
        total += 1
        if is_hira(word): cnt += 1
    return cnt/total
def count_eigo(sentence):
    cnt = 0; total=0
    for word in sentence.split(' '):
        if word == '': continue
        total += 1
        if is_eigo(word): cnt += 1
    return cnt/total
def count_kanj(sentence):
    cnt = 0; total=0
    for word in sentence.split(' '):
        if word == '': continue
        total += 1
        if is_kanj(word): cnt += 1
    return cnt/total

def get_features(trivias_list, content=None, mode='learn'):

    trivias_list = preprocess(trivias_list)
    trivias_df = pd.DataFrame(trivias_list)
    
    wakati_contents_list = trivias_df['wakati_content'].values.tolist()

    word_vectorizer = TfidfVectorizer(max_features=5)
    word_vectorizer.fit(wakati_contents_list)

    if mode == 'inference':
        content = [{'content': content, 'wakati_content': wakati(content)}]
        content_df = pd.DataFrame(content)

        wakati_content_list = content_df['wakati_content'].values.tolist()
        tfidf = word_vectorizer.transform(wakati_content_list)
        content_df = pd.concat([
            content_df,
            pd.DataFrame(tfidf.toarray())
        ], axis=1)
        num_len_df = content_df['wakati_content'].map(count_len)
        num_word_df = content_df['wakati_content'].map(count_word)
        num_hira_df = content_df['wakati_content'].map(count_hira)
        num_kata_df = content_df['wakati_content'].map(count_kata)
        num_eigo_df = content_df['wakati_content'].map(count_eigo)
        num_kanj_df = content_df['wakati_content'].map(count_kanj)
        content_df['num_len'] = num_len_df.values.tolist()
        content_df['num_word'] = num_word_df.values.tolist()
        content_df['num_hira'] = num_hira_df.values.tolist()
        content_df['num_kata'] = num_kata_df.values.tolist()
        content_df['num_eigo'] = num_eigo_df.values.tolist()
        content_df['num_kanj'] = num_kanj_df.values.tolist()

        content_df = content_df.drop('content', axis=1)
        content_df = content_df.drop('wakati_content', axis=1)

        return content_df


    tfidf = word_vectorizer.transform(wakati_contents_list)
    all_df = pd.concat([
        trivias_df,
        pd.DataFrame(tfidf.toarray())
    ], axis=1)

    num_len_df = all_df['wakati_content'].map(count_len)
    num_word_df = all_df['wakati_content'].map(count_word)
    num_hira_df = all_df['wakati_content'].map(count_hira)
    num_kata_df = all_df['wakati_content'].map(count_kata)
    num_eigo_df = all_df['wakati_content'].map(count_eigo)
    num_kanj_df = all_df['wakati_content'].map(count_kanj)
    all_df['num_len'] = num_len_df.values.tolist()
    all_df['num_word'] = num_word_df.values.tolist()
    all_df['num_hira'] = num_hira_df.values.tolist()
    all_df['num_kata'] = num_kata_df.values.tolist()
    all_df['num_eigo'] = num_eigo_df.values.tolist()
    all_df['num_kanj'] = num_kanj_df.values.tolist()

    if mode == 'learn':
        all_df = all_df.drop('id', axis=1)
        all_df = all_df.drop('hee', axis=1)
        all_df = all_df.drop('man_hee', axis=1)
        all_df = all_df.drop('content', axis=1)
        all_df = all_df.drop('wakati_content', axis=1)

    return all_df

modeling

I chose lightgbm as the model because it was annoying.

What can I use with other models, nn I have to normalize if I use it ... Hmm, gradient boosting!

It is a thought circuit. Lol

However, I did the minimum to do it.

--Evaluation index is MSE --Calculate MSE for 5 sets of test data with KFold --Hyperparameter search with optuna with the average value of 5 sets of MSE as the target value - max_depth: 1-20 - learning_rate: 0.001-0.1 - num_leaves: 2-70

The final best MSE was around 0.014.

In other words, the square of the error for one trivia is about 0.014, so if you take the route, it is about 0.118, and since it was normalized, multiply by 100, 11.8.

Therefore, the error was about 11.8. (Hey ...)

The code looks like this.

`train_lgb.py`


import os

import optuna
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import KFold

from data.loader import load_data
from data.feature  import get_features
from data.trivias_list import trivias_list



def objective(trial):

    max_depth = trial.suggest_int('max_depth', 1, 20)
    learning_rate = trial.suggest_uniform('learning_rate', 0.001, 0.1)
    params = {
        'metric': 'l2',
        'num_leaves': trial.suggest_int("num_leaves", 2, 70),
        'max_depth': max_depth,
        'learning_rate': learning_rate,
        'objective': 'regression',
        'verbose': 0
    }

    mse_list = []
    kfold = KFold(n_splits=5, shuffle=True, random_state=1)
    for train_idx, valid_idx in kfold.split(X, y):
        X_train = X.iloc[train_idx]
        y_train = y.iloc[train_idx]
        X_valid = X.iloc[valid_idx]
        y_valid = y.iloc[valid_idx]
        lgb_train = lgb.Dataset(X_train, y_train)
        lgb_valid = lgb.Dataset(X_valid, y_valid)

        model = lgb.train(params,
                          lgb_train,
                          valid_sets=lgb_valid,
                          verbose_eval=10,
                          early_stopping_rounds=30)

        # f-measure
        pred_y_valid = model.predict(X_valid, num_iteration=model.best_iteration)
        true_y_valid = np.array(y_valid.data.tolist())
        mse = np.sum((pred_y_valid - true_y_valid)**2) / len(true_y_valid)
        mse_list.append(mse)

    return np.mean(mse_list)

def build_model():

    study = optuna.create_study()
    study.optimize(objective, n_trials=500)

    valid_split = 0.2
    num_train = int((1-valid_split)*len(X))
    X_train = X[:num_train]
    y_train = y[:num_train]
    X_valid = X[num_train:]
    y_valid = y[num_train:]
    lgb_train = lgb.Dataset(X_train, y_train)
    lgb_valid = lgb.Dataset(X_valid, y_valid)

    lgb_data = lgb.Dataset(X, y)

    params = study.best_params
    params['metric'] = 'l2'
    model = lgb.train(params,
                      lgb_data,
                      valid_sets=lgb_valid,
                      verbose_eval=10,
                      early_stopping_rounds=30)
    
    return model

X, y = load_data(trivias_list)

if __name__ == "__main__":

    model = build_model()
    
    content = 'The honey that bees collect over their lifetime is about a teaspoon.'
    content_df = get_features(trivias_list, content=content, mode='inference')
    output = model.predict(content_df)
    hee = int(output*100)

    print(f"{content}")
    print(f"{hee}Hey")

Anti-pattern

In creating the machine learning model, I will describe the anti-patterns that failed. I hope it will be useful to you and yourself in the future.

Failure ① Scaling

As the scaling of the objective variable, I set it to he number / full he. Fortunately, I was immediately careful, but at first I was fully prepared to use he number as the objective variable. I continued coding as it was. If so, I would have given up saying, "Well, this is ..." without the accuracy.

Failure ② Forgot fit of word_vectorizer

I'm sorry. It's a very detailed story ... However, this is my second failure, so let me write ...

When embedding tfidf from a sentence, you need to take the following three steps.

#Instance generation
word_vectorizer = TfidfVectorizer(max_features=max_features)
#Enter the word-separated list of sentences,fit.
word_vectorizer.fit(wakati_contents_list)
#Embed the desired word-separated list of sentences.
tfidf = word_vectorizer.transform(wakati_contents_list)

Since tfidf has to consider how rare a word exists in the sentence from all the sentences, it is necessary to pass all the sentences to the vectorizer instance once. That is, "embed this sentence". You can't just pass one sentence.

However, I am not directly connected to the image that tfidf is a deterministic process and must be fitted, and I try to embed it with transform appropriately ...

Everyone should be careful.

Failure ③ Delete unused features

I was wondering whether to write this because it is too basic, but it is a fact that it got stuck for about 1 minute, so I will write it.

Yes here.

all_df = all_df.drop('id', axis=1)
all_df = all_df.drop('hee', axis=1)
all_df = all_df.drop('man_hee', axis=1)
all_df = all_df.drop('content', axis=1)
all_df = all_df.drop('wakati_content', axis=1)

Information that is not entered in lightgbm, such as ʻid, content` (original string), etc., should be properly deleted from the pandas data frame.

I think this is surprisingly easy to do.

In the process of generating features, we add hoihoi to the pandas data frame as shown below, so it's easy to forget that we don't need it. (Maybe because we are new to pandas. Deep I learned too much.)

all_df['num_eigo'] = num_eigo_df.values.tolist()

Release

Well, as mentioned above, the finished product has been published on Google Colab.

I built it into a single code with stickytape and devised it so that it can be played with an easy-to-understand UI even on Google Colab. There is an article I wrote about stickytape yesterday, so please have a look.

Make multiple python files into one python file @wataoka

Experiment!

Now that I've made it, let's play with it myself. After learning, input the trivia that you have into Hee-AI and let it infer.

-"Toyotomi Hideyoshi is correct to read as Toyotomi Hideyoshi" ** 72 Hey **

Tough ... スクリーンショット 2020-06-26 16.06.43.png

-"Mr. Noppo has spoken." ** 81 Hey ** At the head family, it's 99 and it's the top in history ...

--"Sunny days are bright" ** 68 Hey ** Obviously, is it a little lower? スクリーンショット 2020-06-26 16.10.59.png

Impressions

Overall, it's a tough evaluation. I never reached 90. Is 5 out of 5 Tamori-san?

Please let me know when anyone reaches 90.

Self-introduction

If you write it at the beginning, it will get in the way, so let me introduce yourself quietly at the end.

name	Aki Wataoka
school	Kobe University Graduate School
Undergraduate research	Machine learning,Speech processing
Graduate study	Machine learning,fairness,Generative model, etc
Twitter	@Wataoka_Koki

I made an AI that predicts from trivia and made me infer my trivia. Hee-AI