This article is ** (Note) A web application that uses TensorFlow to infer recommended song names. [Creating an execution environment with docker-compose] ** is a continuation. Last time, I made TensorFlow and Flask environment with docker-compose, so This time, I would like to organize machine learning using TensorFlow + Keras. Please note that this is an article I made for myself, so it may be difficult to understand, information, and technology may be out of date: bow: Also, I hope it will be helpful for those who want to make some kind of web application by themselves.
The actual web application looks like the GIF below. When I typed in a sentence in the search box, Mr. Humberd Humberd answered "same story": clap: $ \ tiny {* Since there is little learning data, only some songs will be hit. .. It's shabby} $: bow_tone1: $ \ tiny {* Click the score link to see part of the score, but it is out of the scope of the article} $: no_good_tone1:
I used it as a reference when creating this article: bow_tone1:
** (Note) A web application that uses TensorFlow to infer recommended song names. [Creating an execution environment with docker-compose] ** This time is ** machine learning **.
chapter | Classification | Status | Contents | Language, FW, environment, etc. |
---|---|---|---|---|
Preface | Common | Already | App overview | Python TensorFlow Keras Google Colaboratory |
chapter One | Web API | Already | Environment construction (execution environment) | docker-compose Flask Nginx gunicorn |
Chapter II | Web API | Already | (This time) Machine learning | Python TensorFlow Keras Flask |
Chapter 3 | screen | not started yet | Environment | Python Django Nginx gunicorn PostgreSQL virtualenv |
Chapter 4 | screen | not started yet | Display, Web API call part | Python Django |
Chapter 5 | AWS | not started yet | AWS auto-deploy | Github EC2 CodeDeploy CodePipeline |
* I think it will work even if it is not the following Ver, but please note that it is old: no_good_tone2: </ sup>
OS:Ubuntu 18.04.4 LTS
---------------------- -----------
Flask 1.1.0
gunicorn 19.9.0
Keras 2.3.1
Keras-Applications 1.0.8
Keras-Preprocessing 1.1.2
matplotlib 3.1.1
mecab-python3 0.996.2
numpy 1.16.4
pandas 0.24.2
Pillow 7.1.2
pip 20.1
Python 3.6.9
requests 2.22.0
scikit-learn 0.21.2
sklearn 0.0
tensorflow 2.2.0
First of all, I would like to create the following functions. It is a Web API that returns the recommended song title when you enter it and give a sentence (song atmosphere, etc.). The actual Web API is as follows.
In the example, the parameter of the GET method is "Song that wishes someone's happiness." I was able to get the song title "Kumoga Yukunowa" in JSON. [(Example) Web API link](http://52.192.175.215:8888/recommend/api/what-music/ A song that sadly wishes someone's happiness)
The processing flow inside this Web API is as follows. The song title is returned at the end like the flow, but the weight data is read in the middle. This is a pre-trained model created by machine learning. So let's organize how to create a trained model.
The following flow is from the developer's point of view, and is the flow up to machine learning. First, prepare the data of the learning source. This is made with text that humans can understand. Next, preprocessing is performed so that the machine (computer) can understand it. In this example, the training source data is converted to a numerical vector by a method called TF-IDF. Finally, machine learning is performed with MLP (Multilayer Perceptron). Details of each will be described later.
The data that is the learning source is separated by commas as shown below. Original data of machine learning Contains song information (atmosphere, artist name, etc.) for the song title to be inferred. It is separated by "|" (pipe), but it is okay without it.
Convert to a numeric vector with TF-IDF. First, load the learning source data created above. Then divide the sentence into words for TF-IDF calculation (separate) This process uses MeCab for morphological analysis. For reference, the source of the word-separation is as follows.
Below is the code to paste into Google Colaboratory. $ \ tiny {* Don't stare at it} $: no_good_tone1: Please paste and execute in order from the top. ..
Install the required libraries
#Install the required libraries
!apt-get install mecab libmecab-dev mecab-ipadic-utf8
!pip3 install mecab-python3
Word-separated part (partial excerpt)
import MeCab
#Initialization of MeCab
tagger = MeCab.Tagger()
def tokenize(text):
'''Perform morphological analysis with MeCab'''
result = []
word_s = tagger.parse(text)
for n in word_s.split("\n"):
if n == 'EOS' or n == '': continue
p = n.split("\t")[1].split(",")
h, h2, org = (p[0], p[1], p[6])
if not (h in ['noun', 'verb', 'adjective']): continue
if h == 'noun' and h2 == 'number': continue
result.append(org)
return result
#Module testing
if __name__ == '__main__':
print(tokenize("movies|Tetsuya Takeda|painful|I wish the happiness of someone I don't know"))
When you run it, you should see something on the console like this:
['movies', '*', 'Takeda', 'Tetsuya', '*', 'painful', '*', 'know', 'who', 'happiness', 'Wish']
Separated by word. The above example is only one sentence, In the actual program, this process is repeated for the number of sentences (lines) in the file.
If you can divide it, calculate TF-IDF. Regarding TF-IDF, there was an easy-to-understand explanation, so I will quote it. Source: TF-IDF
A value used to extract the feature words of the document from the document. When there are several documents, from the words that appear in them and their frequency, Quantify what is an important word for a document
TF-IDF is expressed by the following formula.
\textrm{TF_IDF}(t) = \textrm{tf}(t,d) × \textrm{idf}(t)
Also, $ \ textrm {tf} (t, d) $ and $ \ textrm {idf} (t) $ are expressed by the following formulas.
\textrm{tf}(t,d) = \frac{n_{t,d}}{\sum_{s \in d}n_{s,d}} \textrm{ , } \textrm{idf}(t) = \log{\frac{N}{df(t)}} + 1
$ n_ {t, d} $: Number of occurrences of a word $ t $ in the document $ d $ $ \ sum_ {s \ in d} n_ {s, d} $: Sum of the number of occurrences of all words in the document $ d $ $ N $: Total number of documents $ df (t) $: Number of documents in which a word $ t $ appears
The above formula can be converted into a program as follows.
TF-IDF calculation(Excerpt)
def calc_files():
'''Calculate the added file'''
global dt_dic
result = []
doc_count = len(files)
dt_dic = {}
#Count the frequency of word occurrence
for words in files:
used_word = {}
data = np.zeros(word_dic['_id'])
for id in words:
data[id] += 1
used_word[id] = 1
#Dt if the word t is used_Add dic
for id in used_word:
if not(id in dt_dic): dt_dic[id] = 0
dt_dic[id] += 1
#Convert the number of appearances to a percentage--- (*10)
data = data / len(words)
result.append(data)
# TF-Calculate IDF
for i, doc in enumerate(result):
for id, v in enumerate(doc):
idf = np.log(doc_count / dt_dic[id]) + 1
doc[id] = min([doc[id] * idf, 1.0])
result[i] = doc
return result
* This [Reference](https://www.amazon.co.jp/%E3%81%99%E3%81%90%E3%81%AB%E4%BD%BF%E3%81 % 88% E3% 82% 8B-% E6% A5% AD% E5% 8B% 99% E3% 81% A7% E5% AE% 9F% E8% B7% B5% E3% 81% A7% E3% 81% 8D% E3% 82% 8B-Python% E3% 81% AB% E3% 82% 88% E3% 82% 8B-AI% E3% 83% BB% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 83% BB% E6% B7% B1% E5% B1% A4% E5% AD% A6% E7% BF% 92% E3% 82% A2% E3% 83% 97% E3% 83% AA% E3% 81% AE% E3% 81% A4% E3% 81% 8F% E3% 82% 8A% E6% 96% B9-% E3% 82% AF% E3% 82% B8 % E3% 83% A9% E9% A3% 9B% E8% A1% 8C% E6% 9C% BA / dp / 4802611641) sample source [^ 1] is almost diverted, but this time on Github I'm giving you the source. </ sup> (source) </ sup> [^ 1]: [Source: Ready to use! Can be practiced in business! Sample code for creating AI / machine learning / deep learning apps with Python](https://github.com/kujirahand/book-mlearn- gyomu)
The source for calculating and outputting TF-IDF from reading the learning source file is as follows. Since the source for calculating the essential TF-IDF is long, it is modularized and read: sweat: It also reads the learning source data. It is stored below, so please upload it. tfidfWithIni.py ← Module to calculate TF-IDF ans_studyInput_fork.txt ← Learning source file
Below is the code to paste into Google Colaboratory for your reference. $ \ tiny {* Don't stare at it} $: no_good_tone1: Please paste and execute in order from the top. ..
step 1_Upload file
#Upload file ("tfidfWithIni".py」、「ans_studyInput_fork.txt」)
from google.colab import files
uploaded = files.upload()
Step 2_Install the required libraries
#Create a directory for saving files
!mkdir text
#Install the required libraries
!apt-get install mecab libmecab-dev mecab-ipadic-utf8
!pip3 install mecab-python3
Step 3_TF-Convert to IDF vector
import os, csv, glob, pickle
import tfidfWithIni
import importlib
#Reloading the module (tfidfWithIni)
importlib.reload(tfidfWithIni)
#Variable initialization
y = []
x = []
#Label code conversion dictionary
labelToCode = {}
#Read csv file
def read_file(path):
'''Add a text file for learning''' # --- (*6)
with open(path, "r", encoding="utf-8") as f:
reader = csv.reader(f)
label_id = 0
for row in reader:
#Label code creation
if row[2] not in labelToCode:
labelToCode[row[2]] = label_id
label_id += 1
y.append(labelToCode[row[2]]) #Set label
tfidfWithIni.add_text(row[3]) #Set sentences
# print("label: ", row[2], "(", labelToCode[row[2]], ")", "Sentence: ", row[3])
#Module testing--- (*15)
if __name__ == '__main__':
# TF-Initialize IDF vector(Empty files)
tfidfWithIni.iniForOri()
#Read the file list--- (*2)
read_file("ans_studyInput_fork.txt")
# TF-Convert to IDF vector--- (*3)
x = tfidfWithIni.calc_files()
#Save--- (*4)
pickle.dump([y, x], open('text/genre.pickle', 'wb'))
tfidfWithIni.save_dic('text/genre-tdidf.dic')
pickle.dump(labelToCode, open('text/label_to_code.pickle', 'wb'))
When executed, the folder and file will be created as shown below.
The dictionary for TF-IDF calculation is the conversion of the words used in the calculation into ID as follows.
With the pre-processing, we are ready for machine learning. Based on the learning data up to the above, learning will be performed so that the song title can be correctly identified. MLP (Multilayer Perceptron) is used as a learning method. MLP is a type of neural network that imitates human nerves. It seems that it consists of layers of three or more nodes. MLP uses a certain method to learn based on learning data (correct data). Even if unknown data (the atmosphere of the song in this example) comes in, it will be possible to correctly determine (the song title in this example). We use the machine learning framework TensorFlow + Keras to do this. And this time, we will create a neural network with the following structure * Image </ sub>: sweat:
Modeling with TensorFlow + Keras to create this neural network looks like this [^ 1].
#Define MLP model structure
model = Sequential()
model.add(Dense(512, activation='relu', input_shape=(in_size,)))
model.add(Dropout(0.2))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(nb_classes, activation='softmax'))
The layer uses what is called Keras Dense. With this, each perceptron will be on the next layer It seems that everything will be connected to Perceptron. Also, the number of inputs is from x1 to xt in the image diagram, but this is defined by the argument input_shape. It's just a few minutes of the words that are made by dividing the whole sentence. In the sample learning file, there are 144 (dimensions). The output is y1 to yclass, which is the number of song titles in the learning file, and is specified by the argument nb_classes. There are 10 (songs) in the sample.
Next, set how to perform training so that it can be correctly determined (compile). RMSprop, as an optimization algorithm based on Keras Documentation Multiclass Classification Problem Let categorical_crossentropy be the loss function. * (Image of words) Loss function: Index for measuring learning deviation, optimization algorithm: Correction method to get closer to the correct answer </ sub>
#Compile the model
model.compile(
loss='categorical_crossentropy',
optimizer=RMSprop(),
metrics=['accuracy'])
Finally, the learning execution part. Learning is performed by the fit method. Input (song atmosphere, etc.) and output (song title) You can learn by giving the Numpy array of to the fit method of the sequence model.
hist = model.fit(x_train, y_train,
batch_size=16, #Number of data to calculate at one time
epochs=150, #Something like the number of times learning is repeated
verbose=1,
validation_data=(x_test, y_test))
Below is the code to paste into Google Colaboratory for your reference.
After executing up to step 3 of [the above TF-IDF vector creation procedure](# tf-idf vector creation procedure), You should be able to perform machine learning by doing the following:
Step 4_Performing Machine Learning (MLP)
import pickle
from sklearn.model_selection import train_test_split
import sklearn.metrics as metrics
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import RMSprop
import matplotlib.pyplot as plt
import numpy as np
import h5py
#Number of labels to classify
labelToCode = pickle.load(open("text/label_to_code.pickle", "rb"))
nb_classes = len(labelToCode)
#Read database
data = pickle.load(open("text/genre.pickle", "rb"))
y = data[0] #Label code
x = data[1] # TF-IDF
#Label data one-Convert to hot vector
y = keras.utils.np_utils.to_categorical(y, nb_classes)
in_size = x[0].shape[0] #Input x[0]Number of elements of
#Separate for learning and testing
x_train, x_test, y_train, y_test = train_test_split(
np.array(x), np.array(y), test_size=0.2)
#Define MLP model structure
model = Sequential()
model.add(Dense(512, activation='relu', input_shape=(in_size,)))
model.add(Dropout(0.2))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(nb_classes, activation='softmax'))
#Compile the model
model.compile(
loss='categorical_crossentropy',
optimizer=RMSprop(),
metrics=['accuracy'])
#Perform learning
hist = model.fit(x_train, y_train,
batch_size=16, #Number of data to calculate at one time
epochs=150, #Something like the number of times learning is repeated
verbose=1,
validation_data=(x_test, y_test))
#evaluate
score = model.evaluate(x_test, y_test, verbose=1)
print("Correct answer rate=", score[1], 'loss=', score[0])
#Save weight data
model.save_weights('./text/genre-model.hdf5')
#Draw the state of learning on a graph
plt.plot(hist.history['val_accuracy'])
plt.title('Accuracy')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
When the execution is finished, the following graph will be displayed and the file (/content/text/genre-model.hdf5) will be displayed. It should have been created additionally. This is the end of machine learning.
In the analogy part, we define the same model as in machine learning. Load the trained model, TF-IDF dictionary, and result label dictionary. Then it converts an unknown document (song atmosphere) into a TF-IDF vector. Finally, if you give the TF-IDF vector to the predict method of Sequencial, the song title will be inferred.
Below is the code to paste into Google Colaboratory for your reference.
After executing up to step 4 of [Execution of machine learning above](# Execution of machine learning) You should be able to guess the song title by doing the following:
By analogy with song titles
import pickle, tfidfWithIni
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import RMSprop
from keras.models import model_from_json
import importlib
#Reloading the module (tfidfWithIni)
importlib.reload(tfidfWithIni)
def inverse_dict(d):
return {v:k for k,v in d.items()}
#Judgment by specifying text
def getMusicName(text):
# TF-Convert to IDF vector
data = tfidfWithIni.calc_text(text)
#Predicted by MLP
pre = model.predict(np.array([data]))[0]
n = pre.argmax()
print("Recommended song name: " + label_dic[n], "(", pre[n], ")")
#Label definition
labelToCode = pickle.load(open("text/label_to_code.pickle", "rb"))
nb_classes = len(labelToCode)
label_dic = inverse_dict(labelToCode)
#Find the number of input elements from the dictionary.
in_size_hantei = pickle.load(open("text/genre-tdidf.dic", "rb"))[0]['_id']
# TF-Read IDF dictionary
tfidfWithIni.load_dic("text/genre-tdidf.dic")
#Define Keras model and read weight data
model = Sequential()
model.add(Dense(512, activation='relu', input_shape=(in_size_hantei,)))
model.add(Dropout(0.2))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(nb_classes, activation='softmax'))
model.compile(
loss='categorical_crossentropy',
optimizer=RMSprop(),
metrics=['accuracy'])
model.load_weights('./text/genre-model.hdf5')
if __name__ == '__main__':
requestParam = """
A song that is sad and wishes for someone's happiness
"""
getMusicName(requestParam)
It may change depending on the learning result, but it will be displayed as follows.
Recommended song name:The clouds are going( 0.99969995 )
The song title analogy with Web API using Flask is previous I'm doing a little bit, so I'd like to omit it: sweat:
This time, I was able to sort out a little about machine learning. Also, I hope I can brush up and organize it little by little when I have time: sob: It is undecided, but next time I would like to organize the environment construction on the screen side.
chapter | Classification | Status | Contents | Language, FW, environment, etc. |
---|---|---|---|---|
Preface | Common | Already | App overview | Python TensorFlow Keras Google Colaboratory |
chapter One | Web API | Already | Environment construction (execution environment) | docker-compose Flask Nginx gunicorn |
Chapter II | Web API | Already | Machine learning | Python TensorFlow Keras Flask |
Chapter 3 | screen | not started yet | (Next time) Environment construction | Python Django Nginx gunicorn PostgreSQL virtualenv |
Chapter 4 | screen | not started yet | Display, Web API call part | Python Django |
Chapter 5 | AWS | not started yet | AWS auto-deploy | Github EC2 CodeDeploy CodePipeline |
Recommended Posts