Machine learning learned with Pokemon

Introduction

Last month, Pokemon Sword Shield was released. By the way, have you ever played Pokemon? As anyone who has played Pokemon knows, Pokemon has stats consisting of HP, Kogeki, Bougyo, Tokukou, Tokubo, and Quickness. It can be said that the higher the ability value of a Pokemon, the stronger the Pokemon. The ability value is calculated from three values: race value, individual value, and effort value. (The calculation formula is written below) ** Race value ** is the value given for each type of Pokemon. ** Individual value ** is a value given to each individual. It shows that the same Pokemon has different strengths. ** Effort value ** is an acquired value. The individual value is determined at birth, while the effort value can be increased by battle. This time, I would like to judge the type of Pokemon from the race value with python.

</ span> ・ HP ability value = (race value x 2 + individual value + effort value ÷ 4) x level ÷ 100 + level + 10 ・ Ability score other than HP = (race value x 2 + individual value + effort value ÷ 4) x level ÷ 100 + 5} x personality correction

Development environment

--CPU: 8th Generation 1.4GHz Quad Core Intel Core i5 Processor

OS: macOS
Visual Studio Code
Python 3.7.3 64-bit (base: conda)

First thing I did

When I searched for "Pokemon Machine Learning", there was a site that was doing something similar, so I used it as a reference. https://www.hands-lab.com/tech/entry/3991.html On this site, I was trying to determine whether it was a water type from the race value, so I implemented it with copy and paste for the time being. I thought it was a success because it was judged with an accuracy of ** 85.3% **, but in reality, only "Lucky" and "Blissey", which are not water types, were judged to be water types.

Now let's sort out the situation. There are 909 types of Pokemon, and 123 types of water-type Pokemon. There are 785 types of Pokemon that are not water type. Here, let's assume a model that determines that it is not a water type no matter what race value is entered. The correct answer rate for this model is 785/909 x 100 = ** 86.5 [%] **.

In other words, in the problem of binary classification, we can see that the result is strange unless the number of samples of the two target classifications is the same.

What I did next

Based on my reflection, I made the number of samples of the two objects I want to classify about the same. This time, I would like to create a model that determines whether it is a steel type or an electric type. (Steel type: 58, Denki type: 60) This time, Pokemon that have a steel and a steel type like a magneton were counted as a steel type. Pokemon data was borrowed from here .

# %%
import pandas as pd
import codecs
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

with codecs.open("data/pokemon_status.csv", "r", "Shift-JIS", "ignore") as file:
    df = pd.read_table(file, delimiter=",")

df.info()


# %%
metal1 = df[df['Type 1'] == "Steel"]
metal2 = df[df['Type 2'] == "Steel"]
metal = pd.concat([metal1, metal2])
print("Steel type pokemon: %d animals" % len(metal))

elec1 = df[df['Type 1'] == "Denki"]
elec2 = df[df['Type 2'] == "Denki"]
elec = pd.concat([elec1, elec2])
print("Electric type Pokemon: %d animals" % len(elec))


def type_to_num(p_type):
    if p_type == "Steel":
        return 0
    else:
        return 1


pokemon_m_e = pd.concat([metal, elec], ignore_index=True)
type1 = pokemon_m_e["Type 1"].apply(type_to_num)
type2 = pokemon_m_e["Type 2"].apply(type_to_num)
pokemon_m_e["type_num"] = type1*type2
pokemon_m_e.head()


# %%
X = pokemon_m_e.iloc[:, 7:13].values
y = pokemon_m_e["type_num"].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)
lr = LogisticRegression(C=1.0)
lr.fit(X_train, y_train)


# %%
print("score for train data: %.3f" % lr.score(X_train, y_train))
print("score for test data: %.3f" % lr.score(X_test, y_test))


# %%
i = 0
error1 = 0
success1 = 0
error2 = 0
success2 = 0
print("[List of Pokemon judged to be steel type]")
print("----------------------------------------")
print("")
while i < len(pokemon_m_e):
    y_pred = lr.predict(X[i].reshape(1, -1))
    if y_pred == 0:
        print(pokemon_m_e.loc[i, ["Pokemon name"]])
        if pokemon_m_e.loc[i, ["type_num"]].values == 0:
            success1 += 1
            print("Steel type")
            print("")
        else:
            error1 += 1
            print("Not a steel type")
            print("")
    else:
        if pokemon_m_e.loc[i, ["type_num"]].values == 0:
            error2 += 1
        else:
            success2 += 1
    i += 1
print("----------------------------------------")
print("Number of Pokemon that are correctly judged to be a steel type: %d animals" % success1)
print("Number of Pokemon that are correctly judged to be electric type: %d animals" % success2)
print("Number of Pokemon that were mistakenly identified as a steel type: %d animals" % error1)
print("Number of Pokemon that were mistakenly identified as electric type: %d animals" % error2)
print("")

Execution result Score for train data: 0.732 score for test data: 0.861
Number of Pokemon that were correctly judged to be a steel type: 48 Number of Pokemon that were correctly judged to be electric type: 43 Number of Pokemon judged to be steel type but not steel type: 13 Number of Pokemon that were not judged to be steel type even though they were steel type: 14

Surprisingly, it was judged correctly, so I think it was generally successful. Rotom was judged to be a steel type (laughs).

What I did more

In the above example, the electric type and the steel type were compared. There are 18 types of Pokemon in all, but I would like to try which combination gives the best judgment accuracy.

# %%
import pandas as pd
import codecs
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

with codecs.open("data/pokemon_status.csv", "r", "Shift-JIS", "ignore") as file:
    df = pd.read_table(file, delimiter=",")

df.info()


# %%
def lr_model_pokemon(type1, type2, test_size=0.3, random_state=0, C=1.0):
    df_type1_1 = df[df['Type 1'] == type1]
    df_type2_1 = df[df['Type 2'] == type1]
    df_type_1 = pd.concat([df_type1_1, df_type2_1])

    df_type1_2 = df[df['Type 1'] == type2]
    df_type2_2 = df[df['Type 2'] == type2]
    df_type_2 = pd.concat([df_type1_2, df_type2_2])

    def type_to_num(p_type):
        if p_type == type1:
            return 0
        else:
            return 1

    pokemon_concat = pd.concat([df_type_1, df_type_2], ignore_index=True)
    type_num1 = pokemon_concat["Type 1"].apply(type_to_num)
    type_num2 = pokemon_concat["Type 2"].apply(type_to_num)
    pokemon_concat["type_num"] = type_num1 * type_num2

    X = pokemon_concat.iloc[:, 7:13].values
    y = pokemon_concat["type_num"].values

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state)
    lr = LogisticRegression(C=C)
    lr.fit(X_train, y_train)

    return [lr.score(X_train, y_train), lr.score(X_test, y_test)]


# %%
max_score_train = 0
max_score_test = 0
train_type1 = ""
test_type1 = ""
train_type2 = ""
test_type2 = ""
type_list = ["Kusa", "Fire", "Mizu", "insect", "normal", "Evil", "Iwa", "Steel",
             "Denki", "ghost", "Dragon", "Esper", "Fighting", "Doku", "Fairy", "Jimen", "flight", "Ice"]

for type1 in type_list:
    for type2 in type_list:
        if type1 == type2:
            continue
        score = lr_model_pokemon(type1=type1, type2=type2)
        if (score[0] >= max_score_train):
            max_score_train = score[0]
            train_type1 = type1
            train_type2 = type2
        if (score[1] >= max_score_test):
            max_score_test = score[1]
            test_type1 = type1
            test_type2 = type2

print("%s, %When s, the score for training data is maximized: score = %.3f" %
      (train_type1, train_type2, max_score_train))
print("%s, %When s, the score for the test data is maximized: score = %.3f" %
      (test_type1, test_type2, max_score_test))

Execution result Steel, normal, maximum score for training data: score = 0.942 Steel, normal, maximizes score for test data: score = 0.962

The accuracy of the model that distinguishes between the steel type and the normal type seems to be the highest. Now, let's actually see what kind of judgment is made.

# %%
def poke_predict(type1, type2):
    type1_1 = df[df['Type 1'] == type1]
    type2_1 = df[df['Type 2'] == type1]
    type_1 = pd.concat([type1_1, type2_1])
    print("%s type pokemon: %d animals" % (type1, len(type_1)))

    type1_2 = df[df['Type 1'] == type2]
    type2_2 = df[df['Type 2'] == type2]
    type_2 = pd.concat([type1_2, type2_2])
    print("%s type pokemon: %d animals" % (type2, len(type_2)))

    def type_to_num(p_type):
        if p_type == type1:
            return 0
        else:
            return 1

    poke_concat = pd.concat([type_1, type_2], ignore_index=True)
    type1_c = poke_concat["Type 1"].apply(type_to_num)
    type2_c = poke_concat["Type 2"].apply(type_to_num)
    poke_concat["type_num"] = type1_c*type2_c
    poke_concat.head()

    X = poke_concat.iloc[:, 7:13].values
    y = poke_concat["type_num"].values

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=0)
    lr = LogisticRegression(C=1.0)
    lr.fit(X_train, y_train)

    i = 0
    error1 = 0
    success1 = 0
    error2 = 0
    success2 = 0
    print("")
    print("[%List of Pokemon judged to be s type]" % type1)
    print("----------------------------------------")
    print("")
    while i < len(poke_concat):
        y_pred = lr.predict(X[i].reshape(1, -1))
        if y_pred == 0:
            print(poke_concat.loc[i, ["Pokemon name"]])
            if poke_concat.loc[i, ["type_num"]].values == 0:
                success1 += 1
                print("%s type" % type1)
                print("")
            else:
                error1 += 1
                print("%Not s type" % type1)
                print("")
        else:
            if poke_concat.loc[i, ["type_num"]].values == 0:
                error2 += 1
            else:
                success2 += 1
        i += 1
    print("----------------------------------------")
    print("Correctly%Number of Pokemon judged to be s type: %d animals" % (type1, success1))
    print("Correctly%Number of Pokemon judged to be s type: %d animals" % (type2, success2))
    print("Accidentally%Number of Pokemon judged to be s type: %d animals" % (type1, error1))
    print("Accidentally%Number of Pokemon judged to be s type: %d animals" % (type2, error2))
    print("")


# %%
poke_predict("Steel", "normal")

Execution result Steel-type Pokemon: 58 Normal type Pokemon: 116
Number of Pokemon that were correctly judged to be a steel type: 50 Number of Pokemon correctly judged to be normal type: 115 Number of Pokemon that were mistakenly identified as a steel type: 1 Number of Pokemon that were mistakenly identified as normal type: 8

Although there is a difference in the number of samples, the accuracy of 94.8% can be said to be quite good. From this result, it can be said that the characteristics of the race value are different between the normal type and the steel type.

At the end

I'm a beginner less than a week after I started learning machine learning, but I think I was able to think deeply. If you have any wrong thoughts in this article, I would appreciate it if you could point them out.