This is the "Yu-Gi-Oh! DS (Data Science)" series that analyzes various Yu-Gi-Oh! Card data using Python. The article will be published four times in total, and finally we will implement a program that predicts offensive and defensive attributes from card names by natural language processing + machine learning. In addition, the author's knowledge of Yu-Gi-Oh has stopped at around E ・ HERO. I'm sorry that both cards and data science are amateurs, but please keep in touch.
No. | Article title | Keyword | |
---|---|---|---|
0 | Get card information from the Yu-Gi-Oh! Database-Yugioh DS 0.Scraping | beautifulsoup | |
1 | Visualize Yu-Gi-Oh! Card data in Python-Yugioh DS 1.EDA edition | pandas, seaborn | This article! |
2 | Process Yu-Gi-Oh card name in natural language-Yugioh DS 2.NLP edition | wordcloud, word2vec, doc2vec, t-SNE | |
3 | Predict offensive and defensive attributes from the Yu-Gi-Oh card name-Yugioh DS 3.Machine learning | lightgbm etc. |
It's been over 10 years since I stopped seeing Yu-Gi-Oh cards, and I'm not sure what kind of cards are available now.
In this first installment, we'll look at all the data cuts with ** Exploratory Data Analysis (EDA) **.
In addition, the technical theme of this article is Visualization with seaborn
. I will try to find an appropriate visualization method and seaborn method according to the nature of each data.
If ʻAnaconda` is included, it should work. Python==3.7.4 seaborn==0.10.0
The data acquired in this article is scraped with a handmade code from Yu-Gi-Oh! OCG Card Database. .. It is the latest as of June 2020. Various data frames are used depending on the graph to be displayed, but all data frames hold the following columns.
No. | Column name | Column name(日本語) | sample | Supplement |
---|---|---|---|---|
1 | name | card name | Ojama Yellow | |
2 | kana | Reading the card name | Ojama Yellow | |
1 | rarity | Rarity | normal | For convenience of acquisition, information such as "restriction" and "prohibition" is also included. |
1 | attr | attribute | 光attribute | For non-monsters, enter "magic" and "trap" |
1 | effect | effect | NaN | Contains "permanent" and "equipment", which are types of magic / trap cards. NaN for monsters |
1 | level | level | 2 | Enter "Rank 2" for rank monsters |
1 | species | Race | Beast tribe | |
1 | attack | Offensive power | 0 | |
1 | defence | Defensive power | 1000 | |
1 | text | Card text | A member of the jama trio who is said to jam by all means. When something happens when all three of us are together... | |
1 | pack | Recording pack name | EXPERT Expert EDITION Edition Volume Volume 2 | |
1 | kind | type | - | In the case of a monster card, information such as fusion and ritual is entered |
All features (columns) can be classified as either categorical data (Categorical) or numerical data (Numerical). Also, when drawing certain data as a graph, the number of features (columns) that can be expressed at one time is at most three. Drawing a graph can be said to be the work of picking up features from the entire data and selecting an appropriate expression method based on each type (category / numerical value). In the following chapters on visualization in the implementation, the graph will be divided into 6 sections according to the number of features selected at one time and the type (category / numerical value) of each feature.
No. | Number of features | combination | useseaborn method of |
---|---|---|---|
1 | 1 | Category data | sns.barplot ,sns.countplot |
2 | 2 | Numerical data x Numerical data | sns.jointplot |
3 | 2 | Category data x numerical data | sns.barplot ,sns.boxplot |
4 | 2 | Category data x category data | sns.heatmap |
5 | 3 | Category data x Category data x Numerical data | sns.catplot |
6 | 3 | Category data x numerical data x numerical data | sns.lmplot |
Import the required packages.
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
sns.set(font="IPAexGothic") #Japanese support for seaborn
Import the four datasets you want to use. The acquisition method of each data set is described in 0. Scraping (No article as of June 2020).
all_data = pd.read_csv("./input/all_data.csv") #Data set for all cards (cards with the same name have duplicate recording packs)
print("all_data: {}rows".format(all_data.shape[0]))
cardlist = pd.read_csv("./input/cardlist.csv") #All card dataset (no duplication)
print("cardlist: {}rows".format(cardlist.shape[0]))
monsters = pd.read_csv("./input/monsters.csv") #Monster card only
print("monsters: {}rows".format(monsters.shape[0]))
monsters_norank = pd.read_csv("./input/monsters_norank.csv") #Remove rank monsters from monster cards
print("monsters_norank: {}rows".format(monsters_norank.shape[0]))
all_data: 21796rows
cardlist: 10410rows
monsters: 6913rows
monsters_norank: 6206rows
Select and visualize only category data. Basic, the horizontal axis is the category, and the vertical axis is the Count.
In seaborn
, it can be expressed by sns.barplot
or sns.countplot
.
For all cards ʻall_data`, the number of recordings is displayed up to the 50th place in the ranking. The number one recording is "Cyclone", and you can see that it has been recorded 45 times. It seems that it is often included in the starter kit (a set that you can dwell immediately if you buy it).
eda3-1-1
#Recording frequency ranking
df4visual = df.groupby("name").count().sort_values(by="kana", ascending=False).head(50)
f, ax = plt.subplots(figsize=(20, 10))
ax = sns.barplot(data=df4visual, x=df4visual.index, y="kana")
ax.set_ylabel("frequency")
ax.set_title("Recording frequency ranking")
for i, patch in enumerate(ax.patches):
ax.text(i, patch.get_height()/2, int(patch.get_height()), ha='center')
plt.xticks(rotation=90);
The breakdown of the attributes is 1946 with the top magic cards. As far as monster cards are concerned, the 1st place is the darkness attribute, and there are more than 3 times the number of the 6th place flame attribute. Why are there 6 gods? I thought, but now there are phoenixes and spheres of Ra's winged dragon. I didn't know.
eda3-1-2
df4visual = cardlist
f, ax = plt.subplots(figsize=(20, 10))
ax = sns.countplot(data=df4visual, x=df4visual.attr, order=df4visual['attr'].value_counts().index)
for i, patch in enumerate(ax.patches):
ax.text(i, patch.get_height()/2, int(patch.get_height()), ha='center')
ax.set_ylabel("frequency")
ax.set_title("Number of sheets (by attribute)");
plt.savefig('./output/eda3-1-2.png', bbox_inches='tight', pad_inches=0)
python
print(cardlist.query("attr == 'God attribute'")["name"])
3471 Horakuti, the Creator of Light
4437 Ra's Wing God Dragon-Phoenix
5998 Ra's Wing God Dragon-Spherical
6677 Obelisk God Warriors
8747 Osiris Sky Dragon
9136 Ra's Wing God Dragon
Name: name, dtype: object
There is a difference in interpretation depending on whether the level is a numerical value or a category, but here we interpret it as an ordinal scale (order has meaning, but interval has no meaning). Rank monsters are excluded and the number of cards by level is displayed. Somehow I intuitively thought, "Odd-level cards are less than even-level cards." There is always 1 <2, 3 <4, ....
eda3-1-3
df4visual = monsters_norank
#After 3-1-Omitted because it is almost the same as 2.
There are quite a few rarities that I don't know (Millennium, etc.). Keep in mind that the denominator uses ʻall_data`, as each recording pack has different rarities. Prohibitions / restrictions may include duplication.
eda3-1-4
df4visual = all_data
#After 3-1-Omitted because it is almost the same as 2.
The image was that there were many wizards and dragons, but surprisingly there were many warriors and machines. Is it because there are many series ("E ・ HERO" or nostalgic)?
eda3-1-5
df4visual = monsters
#After 3-1-Omitted because it is almost the same as 2.
I'm not sure what kind of fusion, ritual, etc. is, but there are many words I don't know.
eda3-1-6
df4visual = monsters
#After 3-1-Omitted because it is almost the same as 2.
In many cases, the x-axis and y-axis have their respective numerical values and are represented by a scatter plot.
With seaborn
, you can draw scatter plots using sns.jointplot
, sns.regplot
, and sns.lmplot
. The usage is slightly different, but please refer to the official document for details.
For each monster card, a scatter plot is drawn with the offensive power on the x-axis and the defensive power on the y-axis.
You can see that many cards are crowded below the range 3000. Also, when considering dividing by a line segment of y = x
, it seems that most cards are generally offensive power> defensive power because the lower right is darker in color.
eda3-2-1
df4visual = monsters
g = sns.jointplot(data=df4visual, x="attack", y="defence", height=10, alpha=0.3)
plt.subplots_adjust(top=0.9)
plt.suptitle('Distribution of offensive power x defensive power')
plt.savefig('./output/eda3-2-1.png', bbox_inches='tight', pad_inches=0)
A pattern is conceivable in which each category is on the x-axis and the numerical data for each category and the aggregated results (total, average, maximum ...) are on the y-axis.
The card name is regarded as a category, and the attack power and defense power are displayed in descending order as sns.barplot
.
It seems that there are still no cards with offensive or defensive power whose original value exceeds 5000.
eda3-3-1
df4visual = monsters
df4visual_atk = df4visual.sort_values("attack", ascending=False).head(50)
df4visual_def = df4visual.sort_values("defence", ascending=False).head(50)
f, ax = plt.subplots(2, 1, figsize = (20, 15), gridspec_kw=dict(hspace=0.8))
f.subplots_adjust(hspace=2.0)
ax[0] = sns.barplot("name", "attack", data=df4visual_atk, ax=ax[0])
ax[0].tick_params(axis='x', labelrotation=90, labelsize = 9)
ax[0].set_xlabel("");
ax[1] = sns.barplot("name", "defence", data=df4visual_def, ax=ax[1])
ax[1].tick_params(axis='x', labelrotation=90, labelsize = 9)
plt.suptitle('Offensive / defensive power ranking')
plt.savefig('./output/eda3-3-1.png', bbox_inches='tight', pad_inches=0)
The offensive and defensive power of each attribute is represented by a box plot. The boxplot is a graph showing five summary statistics (minimum, 1st quartile, median, 3rd quartile, maximum). The horizontal line in each box corresponds to each statistic. In terms of attack power, the median light attribute and the third quartile are higher than others, so it can be seen that there are many monsters with relatively high attack power in the light attribute. Since the defensive power is the same, the light attribute seems to be excellent when looking only at the offense and defense.
eda3-3-2
df4visual = monsters
f, ax = plt.subplots(2, 1, figsize = (20, 10))
ax[0] = sns.boxplot("attr", "attack", data=df4visual, ax=ax[0])
ax[1] = sns.boxplot("attr", "defence", data=df4visual, ax=ax[1])
ax[0].set_xticks([])
ax[0].set_xlabel("")
ax[0].set_title("Attack power distribution (by attribute)")
ax[1].set_title("Defensive power distribution (by attribute)");
plt.savefig('./output/eda3-3-1.png', bbox_inches='tight', pad_inches=0)
monsters.groupby("attr").describe()[['attack', 'defence']]
The level and the median of each value seem to have a nice positive correlation. Also, especially at level 1, you can see that there are multiple monsters with outliers of 2000 or more with offensive and defensive power.
eda3-3-3
df4visual = monsters_norank
#After 3-3-Omitted because it is almost the same as 2.
Interpretation is omitted.
eda3-3-4
df4visual = monsters
#After 3-3-Omitted because it is almost the same as 2.
Take categories on both the x-axis and y-axis, and check the summary statistics (total amount, average, ..., etc.) of the data belonging to both categories.
In this analysis, we will consider using sns.heatmap
to represent the number of cards belonging to both categories as a heatmap.
The combination with the largest number of cards seems to be darkness x demons. If it is only the race, the number of warriors and machines is larger, but after all the darkness is the most in the combination of demons. Ignoring the rare attributes and races, it seems that there are no flame attributes x fish, flame attributes x thunder, etc. yet.
eda3-4-1
df4visual = pd.pivot_table(monsters, index="species", columns="attr", aggfunc="count", values='name').fillna(0).astype("int")
f, ax = plt.subplots(figsize = (20, 10))
ax = sns.heatmap(data=df4visual, cmap="YlGnBu", annot=True, fmt="d")
ax.set_title("Number of sheets (attribute x race)")
plt.savefig('./output/eda3-4-1.png', bbox_inches='tight', pad_inches=0)
It seems that there are many cards with unusual summoning methods for light and dark attributes.
eda3-4-1
df4visual = pd.pivot_table(monsters.query("kind != '-'"), index="kind", columns="attr", aggfunc="count", values='name').fillna(0).astype("int")
#After 3-4-Omitted because it is almost the same as 1.
To represent the distribution of a certain number, we use two categorical data to group them.
There are two ways to divide into groups: (1) x-axis and (2) color.
Sns.catplot
is useful for showing the relationship between numerical data and two or more category data. Numerical values can be segmented by various methods using categorical data such as color coding, division within axes, and division for each table.
The following data is applied to the color, x-axis, and y-axis of the graph. There is no information on the horizontal spread of each data in the axis.
--Color: Level (category data) --x axis: Attribute (category data) --y-axis: offensive power or defensive power (numerical data)
Since the color of the level draws a beautiful gradation, it can be seen that the height of the level and the height of offensive and defensive power have a positive correlation even within each attribute.
eda3-5-1
df4visual = monsters_norank
g1 = sns.catplot(x="attr", y="attack", data=df4visual, aspect=3, hue="level")
g1.ax.set_title("Attack power distribution (by attribute / level)")
plt.savefig('./output/eda3-5-1a.png', bbox_inches='tight', pad_inches=0)
g2 = sns.catplot(x="attr", y="attack", data=df4visual, aspect=3, hue="level")
g2.ax.set_title("Defensive power distribution (by attribute / level)")
plt.savefig('./output/eda3-5-1b.png', bbox_inches='tight', pad_inches=0)
In addition, the results of exchanging the colors and x-axis of the two categories are as follows. ʻEda3-3-3`'s boxplot did not show, for example, monsters with an attack power of 2000 or higher at level 1 are occupied by darkness and wind attributes, and level 11 has a small number of sheets in the first place. You can take it.
The data used is as follows. Many attributes such as which attribute each race is biased to (eda3-4-1), the number of sheets by race (eda3-1-5), distribution of offensive and defensive power by race (eda3-3-4), etc. Information can be read from a single graph.
--Color: Attribute (category data) --x axis: Race (category data) --y-axis: offensive power or defensive power (numerical data)
eda3-5-2
df4visual = monsters
g1 = sns.catplot(x="species", y="attack", data=df4visual, aspect=4, hue="attr")
g1.ax.set_title("Attack power distribution (by race / attribute)")
g1.ax.tick_params(axis='x', labelrotation=90)
plt.savefig('./output/eda3-5-2a.png', bbox_inches='tight', pad_inches=0)
g2 = sns.catplot(x="species", y="defence", data=df4visual, aspect=4, hue="attr")
g2.ax.set_title("Defensive power distribution (by race / attribute)")
g2.ax.tick_params(axis='x', labelrotation=90)
plt.savefig('./output/eda3-5-2b.png', bbox_inches='tight', pad_inches=0)
For the numerical data x numerical data in 3-2., We used a scatter plot that can take numerical values on both the x-axis and y-axis. Here, we will add more information by coloring the scatter plot colors by category.
The scatter plot of offensive and defensive power (eda3-2-1) is colored by level. You can see that the higher the level, the higher the level.
--Color: Level (category data) --x axis: Attack power (category data) --y-axis: Defensive power (numerical data)
eda3-6-1a
df4visual = monsters_norank
g = sns.lmplot("attack","defence",data=df4visual, fit_reg=False, hue="level", height=10)
g.ax.set_title("Offensive / defensive power distribution (by level)")
plt.savefig('./output/eda3-6-1a.png', bbox_inches='tight', pad_inches=0)
In addition, since the level data takes discrete values even though it is categorical data, it is possible to interpret it as numerical data and draw numerical data x numerical data x numerical data.
Use mplot3d
to draw a 3D graph with 3 axes.
However, 3D graphs are not very visible unless they are interactive and can be moved.
eda3-6-1b
from mpl_toolkits import mplot3d
df4visual = monsters_norank
f = plt.figure(figsize = (20, 10))
ax = plt.axes(projection = '3d')
ax.scatter3D(df4visual.attack, df4visual.defence, df4visual.level, c=df4visual.level)
plt.gca().invert_xaxis()
ax.set_xlabel=('attack')
ax.set_ylabel=('defence')
ax.set_zlabel=('level')
plt.suptitle("Offensive power / defensive power / level distribution")
plt.savefig('./output/eda3-6-1b.png', bbox_inches='tight', pad_inches=0)
Thank you for reading this far. Using the data of the Yu-Gi-Oh card, I wrote various ways of thinking when making graphs and how to use seaborn
.
The more features you select at one time, the more information you can convey in a single figure, but it becomes difficult to narrow down the message of that graph. I would like to keep in mind simple feature selection and visualization that expresses only one message that I want to convey (If you look back at the appropriate messaging itself, it may be that you really should keep in mind writing sentences. ・ ・).
Also, although I didn't mention it in each graph, I added various useful tips of seaborn
to each code. It can be surprisingly troublesome to easily do things in Excel or Tableau, such as attaching a value label to each bar or changing the axis name. I hope you will find it helpful.
See also this article → Don't give up on seaborn's fine look adjustments
We are planning to perform an analysis based on the theme of natural language processing for card names that we did not pay much attention to this time. I would like to perform morphological analysis with MeCab, visualize with WordCloud, and calculate similarity with Word2Vec / Doc2Vec.
Recommended Posts