This article is the 18th day of Retty Advent Calendar. Yesterday was @ YutaSakata's I want Kotlin 1.1 for Christmas gifts.
By the way, Christmas is coming soon, but do you have anyone to spend time with? I? I am of course. This child.
If you're alone, you'll want to go drinking with alcohol, right? It is also good to drink moistly at a good store. But what if the store you entered with that in mind was a rear-filled den? The precious gourmet time of loneliness is ruined.
Let's use the power of deep learning to avoid such dangerous stores in advance.
keras can be either tensorflow or theano A library for Deep Learning that works as a backend. It's quite annoying to try to do complicated things, but it's pretty easy to write for most models. I will use this this time.
(2017/3/1 postscript) I use tensorflow for the back end. If it is theano, the following code will not work due to the difference in the handling of channel on CNN. A small fix will fix it. See the comment section for details.
We use Retty reviews for shop reviews. It is a privilege of the inside person that you do not have to do crawling.
Let's make a classifier with Deep Learning because we want to classify shops into rear-filled dens and non-rear-filled dens. As a flow,
It's like that.
There are various ways to make a classifier, but this time we will use ** character-level CNN **.
character-level CNN
When talking about using Deep Learning for natural language processing, LSTM is often mentioned, but this time I will not use it. I use CNN. character-level CNN has very nice features. It means ** no word-separation required **. Character-level CNN works character by character, not word by word, so you don't have to break sentences into words. The outline of the method is as follows.
From here, I will introduce a concrete implementation.
First of all, from the character-level CNN model. It's super easy.
def create_model(embed_size=128, max_length=300, filter_sizes=(2, 3, 4, 5), filter_num=64):
inp = Input(shape=(max_length,))
emb = Embedding(0xffff, embed_size)(inp)
emb_ex = Reshape((max_length, embed_size, 1))(emb)
convs = []
#Apply multiple Convolution 2D
for filter_size in filter_sizes:
conv = Convolution2D(filter_num, filter_size, embed_size, activation="relu")(emb_ex)
pool = MaxPooling2D(pool_size=(max_length - filter_size + 1, 1))(conv)
convs.append(pool)
convs_merged = merge(convs, mode='concat')
reshape = Reshape((filter_num * len(filter_sizes),))(convs_merged)
fc1 = Dense(64, activation="relu")(reshape)
bn1 = BatchNormalization()(fc1)
do1 = Dropout(0.5)(bn1)
fc2 = Dense(1, activation='sigmoid')(do1)
model = Model(input=inp, output=fc2)
return model
I will write a little more about 4. The specifications of the arguments of Convolution2D are as follows.
keras.layers.convolutional.Convolution2D(nb_filter, nb_row, nb_col, init='glorot_uniform', activation='linear', weights=None, border_mode='valid', subsample=(1, 1), dim_ordering='default', W_regularizer=None, b_regularizer=None, activity_regularizer=None, W_constraint=None, b_constraint=None, bias=True)
Here, 2,3,4,5 is specified for nb_row, and embed_size is specified for nb_col. This means that you are applying a kernel that is 2,3,4,5 characters in size. It feels like imitating 2-gram, 3-gram, 4-gram, 5-gram. By connecting these results into one, you can use the results of multiple n-grams together.
The data reading part can be made memory-friendly using a generator, but it's not an image and it doesn't eat that much memory, so let's put it all in memory.
def load_data(filepath, targets, max_length=300, min_length=10):
comments = []
tmp_comments = []
with open(filepath) as f:
for l in f:
#Store ID separated by tab for each line,Premise that word of mouth is written
restaurant_id, comment = l.split("\t", 1)
restaurant_id = int(restaurant_id)
#Convert to UNICODE for each character
comment = [ord(x) for x in comment.strip().decode("utf-8")]
#The long part is censored
comment = comment[:max_length]
comment_len = len(comment)
if comment_len < min_length:
#Reviews that are too short are not eligible
continue
if comment_len < max_length:
#Fill in the missing parts with 0 to make it a fixed length
comment += ([0] * (max_length - comment_len))
if restaurant_id not in targets:
tmp_comments.append((0, comment))
else:
comments.append((1, comment))
#For learning, it is better to have the same number of reviews for dating destinations and others
random.shuffle(tmp_comments)
comments.extend(tmp_comments[:len(comments)])
random.shuffle(comments)
return comments
Let's learn.
def train(inputs, targets, batch_size=100, epoch_count=100, max_length=300, model_filepath="model.h5", learning_rate=0.001):
#Try to reduce the learning rate little by little
start = learning_rate
stop = learning_rate * 0.01
learning_rates = np.linspace(start, stop, epoch_count)
#Modeling
model = create_model(max_length=max_length)
optimizer = Adam(lr=learning_rate)
model.compile(loss='binary_crossentropy',
optimizer=optimizer,
metrics=['accuracy'])
#Learning
model.fit(inputs, targets,
nb_epoch=epoch_count,
batch_size=batch_size,
verbose=1,
validation_split=0.1,
shuffle=True,
callbacks=[
LearningRateScheduler(lambda epoch: learning_rates[epoch]),
])
#Save model
model.save(model_filepath)
if __name__ == "__main__":
comments = load_data(..., ...)
input_values = []
target_values = []
for target_value, input_value in comments:
input_values.append(input_value)
target_values.append(target_value)
input_values = np.array(input_values)
target_values = np.array(target_values)
train(input_values, target_values, epoch_count=50)
When I tried it, the accuracy was over 99% for the training data and less than 80% for the test data.
Now, when you get here, you can determine the word of mouth.
# -*- coding:utf-8 -*-
import numpy as np
from keras.models import load_model
def predict(comments, model_filepath="model.h5"):
model = load_model(model_filepath)
ret = model.predict(comments)
return ret
if __name__ == "__main__":
raw_comment = "Great for dates!"
comment = [ord(x) for x in raw_comment.strip().decode("utf-8")]
comment = comment[:300]
if len(comment) < 10:
exit("too short!!")
if len(comment) < 300:
comment += ([0] * (300 - len(comment)))
ret = predict(np.array([comment]))
predict_result = ret[0][0]
print "Rear filling: {}%".format(predict_result * 100)
Musashi-Koyama Yakitori Wine Shop. It was delicious in general! The price is not cheap, but all the wines are Bio wines. It was said that the Yakitori course was recommended, so go there. The customer service of the clerk is also the best, so please go.
When I tried it above, it was 99.9996066093%. Even if it's yakitori, you can't hide the drifting rear odor. By the way, this review is from our Retty founder, Takeda. You would never have gone to such a glittering place alone. Who did you go with?
I chose it because it is directly connected to the station !! I asked for yakitori, mizutaki, etc., but the birds were plump and delicious !! The image is that the people who finished work are full. But the price was reasonable and it was quite good ♫
In the above it was 2.91604362879e-07%. Even with the same yakitori, if it feels like a flower, the rear filling will drop to this point. Your heart will be calm. This review is from a Retty employee, but let's not know who it is.
Once you've done this, you can find out if it's a rear-filled den by digging into all the reviews of the store and taking the average of the rear-filled reviews.
Deep Learning is also an excellent technique for protecting peace of mind. The character-level CNN came out around the beginning of this year, but recently came out [QRNN](http://metamind.io/research/new-neural-network-building-block-allows-faster-and -more-accurate-text-understanding /) and so on, so I'd like to try it.
Have a nice Christmas.