merge is slow!

Don't you feel? Especially when there are many lines in the competition or it is really slow. How much time will you spend creating features! In such a case, using map will make it faster.

Example

If there is no line number as it is, there will be no difference, so this time I will use kaggle's Titanic. There is a column called Age in it, so I want to count encode it.

`merge_map.py`


import pandas as pd
import time

df = pd.read_csv("train.csv")
df["Age"] = df["Age"].dtype("str")
t1 = time.time()

#pettern 1
df = pd.merge(df,df.Age.value_counts().reset_index().rename(columns = {"Age":"Age_count1"}),
 　　　　　　　　　　　　　　　　　　　　　　　　left_on = "Age", right_on = "index", how = "left")
t2 = time.time()

#pettern 2
df["Age_count2"] = df["Age"].map(df.Age.value_counts())
t3  =time.time()

print(t2-t1)
print(t3-t2)

#output
0.004603147506713867                                                                                                                                                                                                                        
0.0012080669403076172

In this case, map is four times faster. You can also do target encoding with map.

df = pd.merge(df,　df.groupby("Age").Survived.mean().reset_index().rename(columns = 
　　　　　　　　　　　　　　　　　　　　{"Survived":"Age_target1"}), on = "Age", how = "left")
t4 = time.time()
df["Age_target2"] = df["Age"].map(df.groupby(["Age"]).Survived.mean())
print(t4-t3)
print(t5-t4)

#output
0.005101919174194336                                                                                                                                                                                                                        
0.001428842544555664

This is also about four times faster. At this scale, it's still 10 seconds or 40 seconds. It's quite different when you understand 1 minute or 4 minutes.

If you have two keys

The map must have one key. In that case, forcibly make a key.

df = pd.merge(df, df.assign(sex_age_count = 0).groupby(
["Sex", "Age"])["sex_age_count"].count().reset_index(),on = ["Sex", "Age"] ,how = "left")

t6 = time.time()
#Forcibly make a key
df["Sex_Age"] = df["Sex"] + df["Age"]
t7 = time.time()
df["Sex_Age_count"] = df["Sex_Age"].map(df["Sex_Age"].value_counts())
t8 = time.time()

print(t6-t5)
print(t8-t7)

This is also about four times faster.

#output
0.006415843963623047                                                                                                                                                                                                                        
0.0015180110931396484

[Pandas speedup] If you think merge is slow, use map

merge is slow!

Example

merge_map.py

If you have two keys

`merge_map.py`