[Pandas speedup] If you think merge is slow, use map

merge is slow!

Don't you feel? Especially when there are many lines in the competition or it is really slow. How much time will you spend creating features! In such a case, using map will make it faster.

Example

If there is no line number as it is, there will be no difference, so this time I will use kaggle's Titanic. There is a column called Age in it, so I want to count encode it.

merge_map.py


import pandas as pd
import time

df = pd.read_csv("train.csv")
df["Age"] = df["Age"].dtype("str")
t1 = time.time()

#pettern 1
df = pd.merge(df,df.Age.value_counts().reset_index().rename(columns = {"Age":"Age_count1"}),
                         left_on = "Age", right_on = "index", how = "left")
t2 = time.time()

#pettern 2
df["Age_count2"] = df["Age"].map(df.Age.value_counts())
t3  =time.time()

print(t2-t1)
print(t3-t2)
#output
0.004603147506713867                                                                                                                                                                                                                        
0.0012080669403076172

In this case, map is four times faster. You can also do target encoding with map.

df = pd.merge(df, df.groupby("Age").Survived.mean().reset_index().rename(columns = 
                    {"Survived":"Age_target1"}), on = "Age", how = "left")
t4 = time.time()
df["Age_target2"] = df["Age"].map(df.groupby(["Age"]).Survived.mean())
print(t4-t3)
print(t5-t4)
#output
0.005101919174194336                                                                                                                                                                                                                        
0.001428842544555664

This is also about four times faster. At this scale, it's still 10 seconds or 40 seconds. It's quite different when you understand 1 minute or 4 minutes.

If you have two keys

The map must have one key. In that case, forcibly make a key.

df = pd.merge(df, df.assign(sex_age_count = 0).groupby(
["Sex", "Age"])["sex_age_count"].count().reset_index(),on = ["Sex", "Age"] ,how = "left")

t6 = time.time()
#Forcibly make a key
df["Sex_Age"] = df["Sex"] + df["Age"]
t7 = time.time()
df["Sex_Age_count"] = df["Sex_Age"].map(df["Sex_Age"].value_counts())
t8 = time.time()

print(t6-t5)
print(t8-t7)

This is also about four times faster.

#output
0.006415843963623047                                                                                                                                                                                                                        
0.0015180110931396484    

Recommended Posts

[Pandas speedup] If you think merge is slow, use map
pandas idxmax is slow
If you get Error: That port is already in use. In Django
Why you should use Pandas apply ()
If you want to use NumPy, Pandas, Matplotlib, IPython, SciPy on Windows
If you use Pandas' Plot function in Python, it is really seamless from data processing to graph creation
[Pandas] What is set_option [How to use]
If you think the PyCharm environment is broken, it's because of the file name