When processing a large amount of data using pandas It often takes tens of minutes to hours to process a few GB, or days if you're not good at it. If the processing is slow, the work to proceed will not proceed, so Note how you can speed up with a simple source code modification
index | name | weight | height |
---|---|---|---|
0 | Tanaka | 160.1 | 60.1 |
1 | Suzuki | 172.4 | 75.0 |
2 | Saitou | 155.8 | 42.2 |
... | |||
999998 | Morita | 167.9 | 94.07 |
999999 | Satou | 177.7 | 80.3 |
For example, if you have a data frame like the one above When calculating the average value of weight and height, it seems that the process is described as follows.
Bad pattern
import pandas as pd
sr_means = df.mean()
mean_height = sr_means['height']
mean_weight = sr_means['weight']
However, due to the column name containing the string, the above code takes a very long time to calculate
By changing the description as shown below, the processing will be orders of magnitude faster.
Good pattern
import pandas as pd
sr_means = df[['height', 'weight']].mean()
mean_height = sr_means['height']
mean_weight = sr_means['weight']
Postscript: Good pattern
import pandas as pd
sr_means = df.mean(numeric_only = True)
mean_height = sr_means['height']
mean_weight = sr_means['weight']
Actually measure the time
Actual measurement
import pandas as pd
import numpy as np
N = 100000
df_test = pd.DataFrame(
{
'name':['abc'] * N,
'weight': np.random.normal(60, 5, N),
'height': np.random.normal(160, 5, N)
}
)
print("df_test.mean()")
%time df_test.mean()
print("df_test[['height', 'weight']].mean()")
%time df_test[['height', 'weight']].mean()
The above results are below. Even considering that the number of columns to calculate is reduced by one, the latter is about four orders of magnitude faster.
result
df_test.mean()
Wall time: 3.06 s
df_test[['height', 'weight']].mean()
Wall time: 4 ms
For example, the round function is used to round the column weight to an integer. If you are not familiar with python and how to write nowadays, you tend to write using the for statement. Let's use the higher-order function map. (What is a higher-order function is omitted in this article)
Bad pattern
#Apply the round function to the element
for idx in range(len(df_test['weight'].index)):
df_test['weight'][idx] = round(df_test['weight'][idx])
Rewrite below using map
Good pattern
#Apply the round function to the element
df_test['weight'] = df_test['weight'].map(round)
I will actually measure the time this time as well. Since the for statement is too slow, reduce the number of data
Actual measurement
def func(sr):
for idx in range(len(sr.index)):
sr[idx] = round(sr[idx])
return(sr)
N = 1000
df_test = pd.DataFrame(
{
'name':['abc'] * N,
'weight': np.random.normal(60, 5, N),
'height': np.random.normal(160, 5, N)
}
)
print("For for")
%time df_test['weight'] = func(df_test['weight'])
print("For map")
%time df_test['weight'] = df_test['weight'].map(round)
The result is below. What a map can process at the speed of light becomes ridiculously slow with a for statement Because this is only 100 pieces of data It's scary when you think about handling 1 to 100 million data
result
For for
Wall time: 22.1 s
For map
Wall time: 0 ns
Just by improving the above two processes It used to take a day to process data, but now it can be processed in a few minutes. I want to improve it steadily.