This is the "Yu-Gi-Oh! DS (Data Science)" series that analyzes various Yu-Gi-Oh! Card data using Python. The article will be published four times in total, and finally we will implement a program that predicts offensive and defensive attributes from card names by natural language processing + machine learning. In addition, the author's knowledge of Yu-Gi-Oh has stopped at around E ・ HERO. I'm sorry that both cards and data science are amateurs, but please keep in touch.
No. | Article title | Keyword | |
---|---|---|---|
0 | Get card information from the Yu-Gi-Oh! Database-Yugioh DS 0.Scraping | beautifulsoup | |
1 | Visualize Yu-Gi-Oh! Card data in Python-Yugioh Data Science 1.EDA edition | pandas, seaborn | |
2 | Process Yu-Gi-Oh card name in natural language-Yugioh DS 2.NLP edition | wordcloud, word2vec, doc2vec, t-SNE | This article! |
3 | Predict offensive and defensive attributes from the Yu-Gi-Oh card name-Yugioh DS 3.Machine learning | lightgbm etc. |
1. EDA will go deeper into the "card name" that was not the focus.
Various monsters such as dragons, wizards, and HEROs will appear in Yu-Gi-Oh, but we will explore what kind of words are often used in the name. Furthermore, I would like to see what similarities each has when separated by attribute / type / level.
The technical themes of this article are morphological analysis with MeCab
, frequent word visualization with WordCloud
, distributed representation of words with Word2Vec
and Doc2Vec
, dimension compression and word mapping with t-SNE
. I will explain step by step with the implementation code.
Python==3.7.4
The data acquired in this article is scraped with a handmade code from Yu-Gi-Oh! OCG Card Database. .. It is the latest as of June 2020. Various data frames are used depending on the graph to be displayed, but all data frames hold the following columns.
No. | Column name | Column name(日本語) | sample | Supplement |
---|---|---|---|---|
1 | name | card name | Ojama Yellow | |
2 | kana | Reading the card name | Ojama Yellow | |
1 | rarity | Rarity | normal | For convenience of acquisition, information such as "restriction" and "prohibition" is also included. |
1 | attr | attribute | 光attribute | For non-monsters, enter "magic" and "trap" |
1 | effect | effect | NaN | Contains "permanent" and "equipment", which are types of magic / trap cards. NaN for monsters |
1 | level | level | 2 | Enter "Rank 2" for rank monsters |
1 | species | Race | Beast tribe | |
1 | attack | Offensive power | 0 | |
1 | defence | Defensive power | 1000 | |
1 | text | Card text | A member of the jama trio who is said to jam by all means. When something happens when all three of us are together... | |
1 | pack | Recording pack name | EXPERT Expert EDITION Edition Volume Volume 2 | |
1 | kind | type | - | In the case of a monster card, information such as fusion and ritual is entered |
All analysis is intended to be performed with an interactive interpreter such as Jupter Lab
.
Import the required packages.
I don't think that MeCab
, gensim
, and wordcloud
are included in Anaconda from the beginning, so I'll do pip install
if necessary.
python
import matplotlib.pyplot as plt
import MeCab
import numpy as np
import pandas as pd
import re
import seaborn as sns
from gensim.models.doc2vec import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from gensim.models import word2vec
from sklearn.decomposition import TruncatedSVD
from sklearn.manifold import TSNE
from PIL import Image
from wordcloud import WordCloud
%matplotlib inline
sns.set(font="IPAexGothic") #Supports Python in Japanese
The acquisition method of each data set is described in 0. Scraping (No article as of June 2020).
python
#Not used this time
# all_data = pd.read_csv("./input/all_data.csv") #Data set for all cards (cards with the same name have duplicate recording packs)
# print("all_data: {}rows".format(all_data.shape[0]))
cardlist = pd.read_csv("./input/cardlist.csv") #All card dataset (no duplication)
print("cardlist: {}rows".format(cardlist.shape[0]))
#Not used this time
# monsters = pd.read_csv("./input/monsters.csv") #Monster card only
# print("monsters: {}rows".format(monsters.shape[0]))
monsters_norank = pd.read_csv("./input/monsters_norank.csv") #Remove rank monsters from monster cards
print("monsters_norank: {}rows".format(monsters_norank.shape[0]))
cardlist: 10410rows
monsters_norank: 6206rows
The procedure for using MeCab is roughly the following 2 steps.
mecab Tagger
parseToNode ()
that performs morphological analysis and store the result in the node
object.As a result of the above, the node
object contains two attributes.
-** Surface : The word itself. A format that appears as a character string in a sentence - Feature **: List of word information
python
# 1.Instantiate the morphological analyzer and store the processing result in the object with the parseToNode method
text = "Blue-Eyes White Dragon"
mecabTagger = MeCab.Tagger("-Ochasen -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd/") #Dictionary: mecab-ipadic-Use neologd
node = mecabTagger.parseToNode(text)
#2.Surface type(surface)And features(feature)Create a data frame to store
surface_and_feature = pd.DataFrame()
surface = []
feature = []
#3.Extract surface shape and features from node object attributes
while node:
surface.append(node.surface)
feature.append(node.feature)
node = node.next
surface_and_feature['surface'] = surface
surface_and_feature['feature'] = feature
surface_and_feature
It seems that feature
contains a list, so we will further convert it into a data frame.
When using the dictionary mecab-ipadic-neologd
, the contents of the feature are ** Part of speech (pos), Part of speech subclassification 1 (pos1), Part of speech subclassification 2 (pos2), Part of speech subclassification 3 (pos3) ), Conjugation type (ctype), Conjugation type (cform), Base form, Read, Pronounce ** are stored as a list.
Also, BOS / EOS
at the beginning and end of the data frame is a value that directly represents the beginning and end of node
.
python
text = "Blue-Eyes White Dragon"
mecabTagger = MeCab.Tagger("-Ochasen -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd/")
node = mecabTagger.parseToNode(text)
#Feature(feature)Contents of the list(Part of speech,Part of speech細分類1,Part of speech細分類2,Part of speech細分類3,Inflected form,Utilization type,Prototype,reading,pronunciation)In a data frame
features = pd.DataFrame(columns=["pos","pos1","pos2","pos3","ctype","cform","base","read","pronounce"])
posses = pd.DataFrame
while node:
tmp = pd.Series(node.feature.split(','), index=features.columns)
features = features.append(tmp, ignore_index=True)
node = node.next
features
The read data is applied to the MeCab morphological analyzer.
Create a function get_word_list
that decomposes the list of card names into words.
If particles such as "to" and "mo" are inserted, it will be noisy, so use only ** nouns, verbs, and adjectives **.
python
def get_word_list(text_list):
m = MeCab.Tagger ("-Ochasen -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd/")
lines = []
for text in text_list:
keitaiso = []
m.parse('')
node = m.parseToNode(text)
while node:
#Put morphemes in the dictionary
tmp = {}
tmp['surface'] = node.surface
tmp['base'] = node.feature.split(',')[-3] #Prototype(base)
tmp['pos'] = node.feature.split(',')[0] #Part of speech(pos)
tmp['pos1'] = node.feature.split(',')[1] #Part of speech reclassification(pos1)
#BOS representing the beginning and end of a sentence/EOS omitted
if 'BOS/EOS' not in tmp['pos']:
keitaiso.append(tmp)
node = node.next
lines.append(keitaiso)
#Store the surface system for nouns and the original form for verbs / adjectives in the list.
word_list = []
for line in lines:
for keitaiso in line:
if (keitaiso['pos'] == 'noun'):
word_list.append(keitaiso['surface'])
elif (keitaiso['pos'] == 'verb') | (keitaiso['pos'] == 'adjective') :
if not keitaiso['base'] == '*' :
word_list.append(keitaiso['base'])
else:
word_list.append(keitaiso['surface'])
#Uncomment if you want to include nouns, verbs and adjectives
# else:
# word_list.append(keitaiso['surface'])
return word_list
Create two data frames for use in subsequent visualization and modeling processes.
-** cardlist_word_count
: Created based on the unique dataset cardlist
for all cards. The column has the word word
used on all cards and the number of appearances word_count
.
- monsters_words
**: Created based on the dataset monsters_norank
excluding rank monsters from all monsters. It has the word word
used in the column and the features name
, level
, ʻattr,
rarity,
species,
kind` of the card in which the word appears. Note that line units are words, not cards.
By the way, in the name of the Yu-Gi-Oh card, there are many word divisions by the symbol "・", but Mecab
does not divide this symbol. Therefore, before executing the above function, insert the process of dividing the word with "・" in advance.
cardlist_word_count
python
#"・" Creates a pre-separated list namelist
namelist = []
for name in cardlist.name.to_list():
for name_ in name.split("・"):
namelist.append(name_)
#Function get_word_String list word by list_generate list
word_list = get_word_list(namelist)
# word_Data frame words that map words and their frequency of occurrence from list_Generation of df
word_freq = pd.Series(word_list).value_counts()
cardlist_word_count = pd.DataFrame({'word' : word_freq.index,
'word_count' : word_freq.tolist()})
cardlist_word_count
monsters_words
python
monsters_words= pd.DataFrame(columns=["word","name","level","attr","rarity","species","kind"])
for i, name in enumerate(monsters_norank.name.to_list()):
words = get_word_list(name.split("・"))
names = [monsters_norank.loc[i, "name"] for j in words]
levels = [monsters_norank.loc[i, "level"] for j in words]
attrs = [monsters_norank.loc[i, "attr"] for j in words]
rarities = [monsters_norank.loc[i, "rarity"] for j in words]
species = [monsters_norank.loc[i, "species"] for j in words]
kinds = [monsters_norank.loc[i, "kind"] for j in words]
tmp = pd.DataFrame({"word" : words, "name" : names, "level" : levels, "attr" : attrs, "rarity" : rarities, "species" : species, "kind" : kinds})
monsters_words = pd.concat([monsters_words, tmp])
monsters_words
From cardlist_word_count
, take out 50 frequently-used words of all cards and make a ranking.
"Dragon" is overwhelmingly number one with 326 times. A total of 610 times have appeared with similar words "dragon (3rd place)" and "dragon (98th place)".
python
df4visual = cardlist_word_count.head(50)
f, ax = plt.subplots(figsize=(20, 10))
ax = sns.barplot(data=df4visual, x="word", y="word_count")
ax.set_ylabel("frequency")
ax.set_title("Word ranking used in all cards")
for i, patch in enumerate(ax.patches):
ax.text(i, patch.get_height()/2, int(patch.get_height()), ha='center')
plt.xticks(rotation=90)
plt.savefig('./output/nlp5-1.png', bbox_inches='tight', pad_inches=0)
Suddenly, I was curious about the proper use of "dragon" and "dragon" in the ranking, so I will detour and proceed with the search. Take a level on the x-axis and draw the kernel density estimation results for the words "dragon" and "dragon" respectively. Each mountain is drawn so that the total area is 1, and it can be interpreted that many monsters are gathered in the high part of the mountain. The dragon has a mountain peak on the right side of the graph compared to the dragon, so you can see that it is used for relatively high level and strong cards.
python
monsters_words_dragon = monsters_words.query("word == 'dragon' | word == 'Dragon'")
df4visual = monsters_words_dragon
f, ax = plt.subplots(figsize = (20, 5))
ax = sns.kdeplot(df4visual.query("word == 'dragon'").level, label="dragon")
ax = sns.kdeplot(df4visual.query("word == 'Dragon'").level, label="Dragon")
ax.set_xlim([0, 12]);
ax.set_title("Dragon/Dragon kernel distribution")
ax.set_xlabel("level")
plt.savefig('./output/nlp5-1a.png', bbox_inches='tight', pad_inches=0)
The source code and interpretation are omitted, but the count plot
results by level and attribute are also posted.
5-2. WordCloud
WordCloud
is a library used for word visualization. Extract the words that appear frequently, and draw the words that appear more frequently in a larger size. wordcloud.generate_from_frequencies ()
takes a dictionary of words and their frequencies to generate a WordCloud object.
If you look at the figure, you can see that the "dragon" is plotted in the largest size, as in 5-1.
python
def make_wordcloud(df,col_name_noun,col_name_quant):
word_freq_dict = {}
for i, v in df.iterrows(): #Dictionaries of words and their frequencies from data frames
word_freq_dict[v[col_name_noun]] = v[col_name_quant]
fpath = "/System/Library/Fonts/Hiragino Horn Gothic W3.ttc"
#Instantiate WordCloud
wordcloud = WordCloud(background_color='white',
font_path = fpath,
min_font_size=10,
max_font_size=200,
width=2000,
height=500
)
wordcloud.generate_from_frequencies(word_freq_dict)
return wordcloud
f, ax = plt.subplots(figsize=(20, 5))
ax.imshow(make_wordcloud(cardlist_word_count, 'word', 'word_count'))
ax.axis("off")
ax.set_title("All cards WordCloud")
plt.savefig('./output/nlp5-2a.png', bbox_inches='tight', pad_inches=0)
The extraction results by level and attribute are also displayed. It will be a little vertical, but if you are not interested, please scroll and skip it.
** By level ** There are evenly "dragons" at levels 1-12, but at level 9 there are more "dragons", and at level 11 there seems to be no dragon itself in the first place.
** By attribute ** Warrior-type words such as warrior and saber stand out in the earth attribute. It goes without saying that there are many dark attributes such as "demon", "dark", and "demon".
python
def make_wordclouds(df, colname):
wordclouds = []
df = df.sort_values(colname)
for i in df[colname].unique():
# word_freq = df.query("{} == {}".format(colname,i))["word"].value_counts() #Convert to pandas Series and value_counts()
word_freq = df[df[colname] == i]["word"].value_counts()
monsters_word_count = pd.DataFrame({'word' : word_freq.index, 'word_count' : word_freq.tolist()})
wordclouds.append(make_wordcloud(monsters_word_count, 'word', 'word_count'))
f, ax = plt.subplots(len(wordclouds), 1, figsize=(20, 5*int(len(wordclouds))))
for i, wordcloud in enumerate(wordclouds):
ax[i].imshow(wordcloud)
ax[i].set_title("{}:".format(colname) + str(df[colname].unique()[i]))
ax[i].axis("off");
make_wordclouds(monsters_words, "level")
plt.savefig('./output/nlp5-2b.png', bbox_inches='tight', pad_inches=0)
make_wordclouds(monsters_words, "attr")
plt.savefig('./output/nlp5-2c.png', bbox_inches='tight', pad_inches=0)
Vectorization is performed to make it easier for machines to interpret the meaning of words in order to proceed to the similarity between words and the subsequent machine learning process. Converting a word into a vector of several dimensions to several hundred dimensions is called ** distributed representation **.
This time, we will use word2vec
for distributed expression of words. By passing a list of words, you can easily convert it to a vector with any number of dimensions. In addition, Doc2Vec
is used for sentence-based vectorization.
Please refer to the following links for the detailed mechanism and usage of Word2Vec and Doc2Vec.
-Understanding Word2Vec -Summary of Doc2Vec
6-1. Word2Vec
As a preliminary preparation, further modify the data frame monsters_words
created in the previous chapter to create monsters_wordlist
. Return the row unit to the monster unit, and add a new list of words included in this card to the column "word list" and the number of words as the column "length".
python
wordlist = monsters_words.groupby("name")["word"].apply(list).reset_index()
wordlist.columns = ["name", "wordlist"]
wordlist["length"] = wordlist["wordlist"].apply(len)
monsters_wordlist = pd.merge(wordlist, monsters_norank, how="left")
monsters_wordlist
Here is the code that actually performs the modeling. size
is the number of dimensions, ʻiter is the number of repeated learning, and
windwow` is a parameter that indicates how many words before and after the learning.
python
%time model_w2v = word2vec.Word2Vec(monsters_wordlist["wordlist"], size=30, iter=3000, window=3)
model_w2v
After learning, let's verify it easily. The wv.most_similar ()
method allows you to see the top n words that are determined to have similar meanings to a word.
When I tried inputting "red", "** black **", which also represents the color, came first. it is a good feeling!
If this recommendation result is not correct, move the above parameters in various ways and repeat the verification.
python
model_w2v.wv.most_similar(positive="Red", topn=20)
[('black', 0.58682781457901),
('Devil', 0.5581836700439453),
('Artif', 0.5535239577293396),
('phantom', 0.4850098788738251),
('To be', 0.460792601108551),
('of', 0.4455495774745941),
('Ancient', 0.43780404329299927),
('Water', 0.4303821623325348),
('Dragon', 0.4163920283317566),
('Holy', 0.4114375710487366),
('Genesis', 0.3962644040584564),
('Sin', 0.36455491185188293),
('white', 0.3636135756969452),
('Giant', 0.3622574210166931),
('Road', 0.3602677285671234),
('Guardian', 0.35134968161582947),
('power', 0.3466736972332001),
('Elf', 0.3355366587638855),
('gear', 0.3334060609340668),
('driver', 0.33207967877388)]
Next, let's consider visualizing this result. Since Word2Vec this time converts words into 30-dimensional vectors, it is necessary to reduce the dimensions (** dimension reduction **) in order to graph.
t-SNE
is one of the models of unsupervised learning that reduces dimensions, and it is possible to aggregate data in any dimension without losing information (dispersion) as much as possible.
Consider plotting on a scatter plot with an xy axis and implement the process of dropping 30 dimensions into 2 dimensions.
python
#Extract 200 frequently-used words
n=200
topwords = monsters_words["word"].value_counts().head(n)
w2v_vecs = np.zeros((topwords.shape[0],30))
for i, word in enumerate(topwords.index):
w2v_vecs[i] = model_w2v.wv[word]
# t-Dimensionality reduction with SNE: Drop from 30 dimensions to 2 dimensions
tsne= TSNE(n_components=2, verbose=1, n_iter=500)
tsne_w2v_vecs = tsne.fit_transform(w2v_vecs)
w2v_x = tsne_w2v_vecs[:, 0]
w2v_y = tsne_w2v_vecs[:, 1]
Since each word has two-dimensional vector data, draw a scatter plot with each on the x and y axes. If the information of the original data remains even after dimensionality reduction, it should be possible to interpret that the words closer to each other have similar meanings. At first glance, the plot result seems to have randomly arranged words, but it can also be seen as having similar meanings as shown below.
--Near the center left: Nouns representing people such as "person," "lady," and "man" are solidified. --Near the bottom: Nouns such as "master," "king," and "god" that represent people who are deified and superior are solidified.
python
df4visual = pd.DataFrame({"word":topwords.index, "x":w2v_x, "y":w2v_y})
f, ax = plt.subplots(figsize=(20, 20))
ax = sns.regplot("x","y",data=df4visual,fit_reg=False, scatter_kws={"alpha": 0.2})
for i, text in enumerate(topwords.index):
ax.text(df4visual.loc[i, 'x'], df4visual.loc[i, 'y'], text)
ax.axis("off")
ax.set_title("Visualization of similarity of 200 words that frequently appear in card titles")
plt.savefig('./output/nlp6-1.png', bbox_inches='tight', pad_inches=0)
5-2. Doc2Vec
While Word2Vec
acquires the distributed expression of words, Doc2Vec
can acquire the distributed expression of sentences by adding the sentence to which the word belongs as tag information at the time of learning. This allows you to measure the meaning between sentences (card names) and then the umbrella.
As a preliminary preparation, create a TaggedDocument
as the input of the model. Assign the card name that the word makes up to the list of words as a tag.
python
document = [TaggedDocument(words = wordlist, tags = [monsters_wordlist.name[i]]) for i, wordlist in enumerate(monsters_wordlist.wordlist)]
document[0]
TaggedDocument(words=['A', 'BF', 'May rain', 'Sohaya'], tags=['A BF-May rainのSohaya'])
The learning method is almost the same as word2vec
. The learning method dm
is set to the default 0, the number of dimensions vector_size
is set to 30, and the number of repetitions ʻepochs` is set to 200. By the way, 1 epoch means to input all the words in the dataset once.
python
%time model_d2v = Doc2Vec(documents = document, dm = 0, vector_size=30, epochs=200)
Let's run the test in the same way. In the docvecs.most_similar ()
method, use the card name as an input and check the top few similar card names.
When I entered "Black Magician", the Black Magician Girl returned in 1st place. Since the card names using the same word follow, it seems that the learning is almost done properly!
python
model_d2v.docvecs.most_similar("black magician")
[('Dark magician girl', 0.9794564843177795),
('Toon Dark Magician', 0.9433020949363708),
('Toon Dark Magician Girl', 0.9370808601379395),
('Dragon Knight Dark Magician', 0.9367024898529053),
('Dragon Knight Dark Magician Girl', 0.93293297290802),
('Black Bull Drago', 0.9305672645568848),
('Magician of Black Illusion', 0.9274455904960632),
('Astrograph magician', 0.9263750314712524),
('Chronograph magician', 0.9257084727287292),
('Disc magician', 0.9256418347358704)]
Dimension reduction is also performed in the same way as word2vec
, and 200 cards are randomly selected for visualization.
Similar words come in almost the same position, which makes it a little hard to see. .. However, you can see that the card names with the same word are scattered nearby.
python
d2v_vecs = np.zeros((monsters_wordlist.name.shape[0],30))
for i, word in enumerate(monsters_wordlist.name):
d2v_vecs[i] = model_d2v.docvecs[word]
tsne = TSNE(n_components=2, verbose=1, n_iter=500)
tsne_d2v_vecs = tsne.fit_transform(d2v_vecs)
d2v_x = tsne_d2v_vecs[:, 0]
d2v_y = tsne_d2v_vecs[:, 1]
monsters_vec = monsters_wordlist.copy()
monsters_vec["x"] = d2v_x
monsters_vec["y"] = d2v_y
df4visual = monsters_vec.sample(200, random_state=1).reset_index(drop=True)
f, ax = plt.subplots(figsize=(20, 20))
ax = sns.regplot("x","y",data=df4visual, fit_reg=False, scatter_kws={"alpha": 0.2})
for i, text in enumerate(df4visual.name):
ax.text(df4visual.loc[i, 'x'], df4visual.loc[i, 'y'], text)
ax.axis("off")
ax.set_title("Visualization of similarity of 200 monsters")
plt.savefig('./output/nlp6-2a.png', bbox_inches='tight', pad_inches=0)
Since it's a big deal, I'll plot all the cards without a card name. The following is a scatter plot that depicts the closeness (vector) of meaning of all cards by different races. I think it was a great result! The circular group at the bottom of the graph can be inferred to be a collection of non-series cards. Cards are sparsely scattered around it, but you can probably see that they form a small group in the same series.
python
df4visual = monsters_vec
g = sns.lmplot("x","y",data=df4visual, fit_reg=False, hue="attr", height=10)
g.ax.set_title("Distribution of closeness of meaning of all card names")
For example, there is a group of earth attributes around the coordinates (x, y) = (-40, -20). If you search this information with a query, you will find that it is a collection of the "Ancient Machines" series. feel well!
python
monsters_vec.query("-42 <= x <= -38 & -22 <= y <= -18")["name"]
2740 Perfect Machine King
3952 Ancient lizard warrior
3953 Ancient mechanical soldier
3954 Ancient mechanical synthetic beast
3955 Ancient mechanical synthetic dragon
3956 Ancient mechanic
3957 Ancient mechanical giant
3958 Ancient Mechanical Giants-Ultimate Pound
3959 Ancient mechanical giant dragon
3960 Ancient mechanical chaos giant
3961 Ancient mechanical thermonuclear dragon
3962 Ancient mechanical hounds
3963 Ancient mechanical beast
3964 Ancient mechanical battery
3965 Ancient Machine Ultimate Giants
3966 Ancient machine box
3967 Ancient mechanical body
3968 Ancient mechanical giant
3969 Ancient mechanical flying dragon
3970 Ancient mechanical knight
3971 Ancient mechanical genie
3972 Ancient gear
3973 Ancient gear machine
3974 Ancient Mage
4036 Earth Giants Gaia Plate
4279 Giants goggles
4491 Pendulum blade torture machine
4762 Mechanical Soldier
4764 Machine dog Maron
4765 Machine King
4766 Machine King-Prototype
4767 Mechanical Dragon Power Tool
4768 Mechanical Sergeant
4994 Lava Giant
5247 Sleeping Giant Zushin
5597 Super ancient monster
Finally, let's check the similarity by attribute, race, and level, not by card name. The vector obtained for each card is averaged for each attribute, race, and level, and plotted for each data cut.
** By attribute **
Only the darkness attribute was mapped to a place that was unpleasant.
python
df4visual = monsters_vec.groupby("attr").mean()[["x", "y"]].reset_index().query("attr != 'God attribute'").reset_index(drop=True) # God attributeは外れ値になるため省略する
f, ax = plt.subplots(figsize=(10, 10))
ax = sns.regplot("x","y",data=df4visual, fit_reg=False, scatter_kws={"alpha": 0.2})
for i, text in enumerate(df4visual.attr):
ax.text(df4visual.loc[i, 'x'], df4visual.loc[i, 'y'], text)
ax.set_title("Visualization of similarity of card names by attribute")
plt.savefig('./output/nlp6-2c.png', bbox_inches='tight', pad_inches=0)
** By race **
If you try to interpret it by force, the fish and reptiles are close together.
python
df4visual = monsters_vec.groupby("species").mean()[["x", "y"]].reset_index().query("species != 'Creative deity' & species != 'Phantom Beast'").reset_index(drop=True) #God attribute races are outliers and are omitted
f, ax = plt.subplots(figsize=(15, 15))
ax = sns.regplot("x","y",data=df4visual, fit_reg=False, scatter_kws={"alpha": 0.2})
for i, text in enumerate(df4visual.species):
ax.text(df4visual.loc[i, 'x'], df4visual.loc[i, 'y'], text)
ax.axis("on")
ax.set_title("Visualization of similarity of card names by race")
plt.savefig('./output/nlp6-2d.png', bbox_inches='tight', pad_inches=0)
** By level **
You can see that the low level band (1 ~ 4) is quite close. Even in the high level band, 10 and 11 are close, but 12 are far apart, so it can be inferred that they have different name characteristics.
python
df4visual = monsters_vec.groupby("level").mean()[["x", "y"]].reset_index().query("level != '0'").reset_index(drop=True) #Level 0 is an outlier and is omitted
f, ax = plt.subplots(figsize=(10, 10))
ax = sns.regplot("x","y",data=df4visual, fit_reg=False, scatter_kws={"alpha": 0.2})
for i, text in enumerate(df4visual.level):
ax.text(df4visual.loc[i, 'x'], df4visual.loc[i, 'y'], text)
ax.set_title("Visualization of card name similarity by level")
plt.savefig('./output/nlp6-2e.png', bbox_inches='tight', pad_inches=0)
Thank you for reading this far. Further deepening the Yu-Gi-Oh card name, we performed a series of analysis of morphological analysis by Mecab
, visualization by WordCloud
, and acquisition of distributed expression by Word2Vec
and Doc2Vec
.
I think the scatter plot of all the cards in Doc2Vec feels good to me. In the machine learning part of the next process, the features obtained here will be used as they are, so we expect to be able to build a highly accurate prediction model.
It's finally machine learning. I haven't implemented it yet, and I'm thinking about the theme, but I'd like to build the following prediction model. Please look forward to it.
Doc2Vec
& LightGBM
LSTM
(this may be omitted due to time constraints ...)Recommended Posts