Don't you feel? Especially when there are many lines in the competition or it is really slow. How much time will you spend creating features! In such a case, using map will make it faster.
If there is no line number as it is, there will be no difference, so this time I will use kaggle's Titanic. There is a column called Age in it, so I want to count encode it.
merge_map.py
import pandas as pd
import time
df = pd.read_csv("train.csv")
df["Age"] = df["Age"].dtype("str")
t1 = time.time()
#pettern 1
df = pd.merge(df,df.Age.value_counts().reset_index().rename(columns = {"Age":"Age_count1"}),
left_on = "Age", right_on = "index", how = "left")
t2 = time.time()
#pettern 2
df["Age_count2"] = df["Age"].map(df.Age.value_counts())
t3 =time.time()
print(t2-t1)
print(t3-t2)
#output
0.004603147506713867
0.0012080669403076172
In this case, map is four times faster. You can also do target encoding with map.
df = pd.merge(df, df.groupby("Age").Survived.mean().reset_index().rename(columns =
{"Survived":"Age_target1"}), on = "Age", how = "left")
t4 = time.time()
df["Age_target2"] = df["Age"].map(df.groupby(["Age"]).Survived.mean())
print(t4-t3)
print(t5-t4)
#output
0.005101919174194336
0.001428842544555664
This is also about four times faster. At this scale, it's still 10 seconds or 40 seconds. It's quite different when you understand 1 minute or 4 minutes.
The map must have one key. In that case, forcibly make a key.
df = pd.merge(df, df.assign(sex_age_count = 0).groupby(
["Sex", "Age"])["sex_age_count"].count().reset_index(),on = ["Sex", "Age"] ,how = "left")
t6 = time.time()
#Forcibly make a key
df["Sex_Age"] = df["Sex"] + df["Age"]
t7 = time.time()
df["Sex_Age_count"] = df["Sex_Age"].map(df["Sex_Age"].value_counts())
t8 = time.time()
print(t6-t5)
print(t8-t7)
This is also about four times faster.
#output
0.006415843963623047
0.0015180110931396484
Recommended Posts