Speed comparison when shifting by group by pandas

Introduction

When playing with table data in pandas, you may want to shift the row direction for each category (for example, you want to shift the time series data by one period for each user). If you want to transform data by group with pandas, you can honestly do it with groupby (). Transform. However, the process of grouping takes a lot of time. So, let's examine how long it will take in several ways.

environment

--OS name Microsoft Windows 10 Home --Processor Intel (R) Core (TM) i5-7200U CPU @ 2.50GHz, 2712 Mhz, 2 cores, 4 logical processors --Installed physical memory (RAM) 8.00 GB

procedure

Data preparation

I prepared 10 million lines for the time being. There are 7 variables, 5 are appropriate numbers, and 2 are categorical variables (X, Y), which are 10 categories and 4 categories, respectively.

import numpy as np
import pandas as pd


x = np.arange(10_000_000)
y = np.tile(np.arange(10), int(len(x)/10))
z = np.tile(np.arange(4), int(len(x)/4))

df = pd.DataFrame({"a": x, "b": x, "c": x, "d": x, "e": x, "Y":y, "Z": z})

image.png

Experiment

This time I tried to group by two categorical variables.

Method 1

In a straightforward way, this is the basis for comparison.

%%timeit -n 1 -r 10

s = df.groupby(["Y", "Z"])["a"].transform(lambda x: x.shift(1))
# 3.25 s ± 107 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

Method 2

This is a method of combining variables to be grouped in advance. It's faster, but it takes longer to join, so it's suitable for frequent grouping.

dg = df.copy()
dg["YZ"] = dg["Y"].astype("str") + dg["Z"].astype("str")
# 13.7 s ± 964 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
%%timeit -n 1 -r 10

s = dg.groupby(["YZ"])["a"].transform(lambda x: x.shift(1))
# 2.62 s ± 25.1 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

Method 3

This is a method to create and execute a numpy shift function instead of the pandas shift method. The speed doesn't seem to change that much.

Reference: python --Shift elements in a numpy array --Stack Overflow

def shift2(arr, num):
    result = np.empty(arr.shape[0])
    if num > 0:
        result[:num] = np.nan
        result[num:] = arr[:-num]
    elif num < 0:
        result[-num:] = np.nan
        result[:num] = arr[-num:]
    else:
        result = arr
    return result
%%timeit -n 1 -r 10

s = df.groupby(["Y", "Z"])["a"].transform(lambda x: shift2(x, 1))
# 3.2 s ± 15.1 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

Method 4

After grouping, iterates to process each group. Even with this method, the speed does not change so much, and more flexible processing can be applied, so it is quite a favorite method.

%%timeit -n 1 -r 10

l = [group["a"].shift(1) for _, group in df.groupby(["Y", "Z"])]    
dh = pd.concat(l, axis=0).sort_index()
# 3.12 s ± 14.4 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

Method 5

Actually, transform is unnecessary. This is the fastest.

%%timeit -n 1 -r 10

s = df.groupby(["Y", "Z"])["a"].shift(1)
# 983 ms ± 10.9 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

result

Method Description time(per loop)
Method 1 Standard method 3.25 s ± 0.107 s
Method 2 Pre-join 2.62 s ± 0.0251 s
Method 3 numpy shift 3.2 s ± 0.0151 s
Method 4 Iteration 3.12 s ± 0.0144 s
Method 5 No transform 0.983 s ± 0.0109 s

in conclusion

Don't use transform (commandment)

Recommended Posts

Speed comparison when shifting by group by pandas
Standardize by group with pandas
Stack processing speed comparison by language
Manipulating strings with pandas group by
Feature generation with pandas group by
Speed comparison when searching with multiple keys in pandas (MultiIndex vs Others)
When to_csv with Pandas, it became line by line
Sort by pandas
Speed comparison of each language by Monte Carlo method