That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 1]

We will solve the Python problem of Data Science 100 Knock (Structured Data Processing). This group of questions uses pandas for data processing in the model answer, but we will process it using NumPy's structured array after studying.

: arrow_forward: Next article (# 2)

Introduction

Many people who do data science in Python may be pandas lovers, but in fact ** you can do the same with NumPy without using pandas **. And NumPy is usually faster. As a person who loves pandas, I'm still not used to operating NumPy, so I'd like to try to graduate from pandas by operating this "Data Science 100 Knock" with NumPy this time.

This time I will do up to the 8th question. It seems that only receipt.csv is used this time. The initial data was read as follows (data type specification is postponed for the time being).

import numpy as np
import pandas as pd

#For model answer
df_receipt = pd.read_csv('data/receipt.csv')

#Data we handle
arr_receipt = np.genfromtxt(
    'data/receipt.csv', delimiter=',', encoding='utf-8',
    names=True, dtype=None)

It is memory saving.

import sys

sys.getsizeof(df_receipt)
# 26065721
sys.getsizeof(arr_receipt)
# 15074160

P_001

P-001: Display the first 10 items of all items from the data frame (df_receipt) of the receipt details, and visually check what kind of data you have.

Take it in slices.

`In[001]`


arr_receipt[:10]

You can get the data as follows. Moreover, it is smart enough to align the commas (unless there is a double-byte character string).

`Out[001]`


array([(20181103, 1257206400, 'S14006',  112, 1, 'CS006214000001', 'P070305012', 1, 158),
       (20181118, 1258502400, 'S13008', 1132, 2, 'CS008415000097', 'P070701017', 1,  81),
       (20170712, 1215820800, 'S14028', 1102, 1, 'CS028414000014', 'P060101005', 1, 170),
       (20190205, 1265328000, 'S14042', 1132, 1, 'ZZ000000000000', 'P050301001', 1,  25),
       (20180821, 1250812800, 'S14025', 1102, 2, 'CS025415000050', 'P060102007', 1,  90),
       (20190605, 1275696000, 'S13003', 1112, 1, 'CS003515000195', 'P050102002', 1, 138),
       (20181205, 1259971200, 'S14024', 1102, 2, 'CS024514000042', 'P080101005', 1,  30),
       (20190922, 1285113600, 'S14040', 1102, 1, 'CS040415000178', 'P070501004', 1, 128),
       (20170504, 1209859200, 'S13020', 1112, 2, 'ZZ000000000000', 'P071302010', 1, 770),
       (20191010, 1286668800, 'S14027', 1102, 1, 'CS027514000015', 'P071101003', 1, 680)],
      dtype=[('sales_ymd', '<i4'), ('sales_epoch', '<i4'), ('store_cd', '<U6'), ('receipt_no', '<i4'), ('receipt_sub_no', '<i4'), ('customer_id', '<U14'), ('product_cd', '<U10'), ('quantity', '<i4'), ('amount', '<i4')])

Let's compare the speed with the model answer (pandas).

`Time[001]`


#Model answer
%timeit df_receipt.head(10)
# 130 µs ± 5.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit arr_receipt[:10]
# 244 ns ± 8.23 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

I was able to get it at 1/500 of the speed of the model answer.

P_002

P-002: Specify columns in the order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd), and sales amount (amount) from the receipt statement data frame (df_receipt), and display 10 items.

NumPy structured arrays can be manipulated in the same way as the pd.DataFrame we are used to.

`In[002]`


arr_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']][:10]

`Out[002]`


array([(20181103, 'CS006214000001', 'P070305012', 158),
       (20181118, 'CS008415000097', 'P070701017',  81),
       (20170712, 'CS028414000014', 'P060101005', 170),
       (20190205, 'ZZ000000000000', 'P050301001',  25),
       (20180821, 'CS025415000050', 'P060102007',  90),
       (20190605, 'CS003515000195', 'P050102002', 138),
       (20181205, 'CS024514000042', 'P080101005',  30),
       (20190922, 'CS040415000178', 'P070501004', 128),
       (20170504, 'ZZ000000000000', 'P071302010', 770),
       (20191010, 'CS027514000015', 'P071101003', 680)],
      dtype={'names':['sales_ymd','customer_id','product_cd','amount'], 'formats':['<i4','<U14','<U10','<i4'], 'offsets':[0,40,96,140], 'itemsize':144})

Faster than manipulating pd.DataFrame.

`Time[002]`


%timeit df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']].head(10)
# 5.19 ms ± 43.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit arr_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']][:10]
# 906 ns ± 17.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

P_003

P-003: Specify columns in the order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd), and sales amount (amount) from the receipt statement data frame (df_receipt), and display 10 items. However, sales_ymd should be extracted while changing the item name to sales_date.

Renaming is a simple and annoying thing. It's easiest to go through np.lib.recfunctions.rename_fields (), but this function is still quite awkward to use.

`In[003]`


np.lib.recfunctions.rename_fields(arr_receipt, {'sales_ymd': 'sales_date'})[
    ['sales_date', 'customer_id', 'product_cd', 'amount']][:10]

`Out[003]`


array([(20181103, 'CS006214000001', 'P070305012', 158),
       (20181118, 'CS008415000097', 'P070701017',  81),
       (20170712, 'CS028414000014', 'P060101005', 170),
       (20190205, 'ZZ000000000000', 'P050301001',  25),
       (20180821, 'CS025415000050', 'P060102007',  90),
       (20190605, 'CS003515000195', 'P050102002', 138),
       (20181205, 'CS024514000042', 'P080101005',  30),
       (20190922, 'CS040415000178', 'P070501004', 128),
       (20170504, 'ZZ000000000000', 'P071302010', 770),
       (20191010, 'CS027514000015', 'P071101003', 680)],
      dtype={'names':['sales_date','customer_id','product_cd','amount'], 'formats':['<i4','<U14','<U10','<i4'], 'offsets':[0,40,96,140], 'itemsize':144})

However, it processes very fast compared to pandas.

`Time[003]`


%timeit df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']].rename(columns={'sales_ymd': 'sales_date'}).head(10)
# 12.8 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit np.lib.recfunctions.rename_fields(arr_receipt, {'sales_ymd': 'sales_date'})[['sales_date', 'customer_id', 'product_cd', 'amount']][:10]
# 6.19 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Pandas is slow and heavy because it creates new dataframe objects every time it processes.

P_004

P-004: Data that meets the following conditions by specifying columns in the order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd), and sales amount (amount) from the receipt statement data frame (df_receipt). Extract.

--Customer ID (customer_id) is "CS018205000001"

This can also be obtained intuitively.

`In[004]`


arr_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']][
    arr_receipt['customer_id'] == 'CS018205000001']

`Out[004]`


arr_receipt[['sales_ymd','customer_id','product_cd','amount']][arr_receipt['customer_id'] == 'CS018205000001']
array([(20180911, 'CS018205000001', 'P071401012', 2200),
       (20180414, 'CS018205000001', 'P060104007',  600),
       (20170614, 'CS018205000001', 'P050206001',  990),
       (20170614, 'CS018205000001', 'P060702015',  108),
       (20190216, 'CS018205000001', 'P071005024',  102),
       (20180414, 'CS018205000001', 'P071101002',  278),
       (20190226, 'CS018205000001', 'P070902035',  168),
       (20190924, 'CS018205000001', 'P060805001',  495),
       (20190226, 'CS018205000001', 'P071401020', 2200),
       (20180911, 'CS018205000001', 'P071401005', 1100),
       (20190216, 'CS018205000001', 'P040101002',  218),
       (20190924, 'CS018205000001', 'P091503001',  280)],
      dtype={'names':['sales_ymd','customer_id','product_cd','amount'], 'formats':['<i4','<U14','<U10','<i4'], 'offsets':[0,40,96,140], 'itemsize':144})

The model answer seems to love pd.DataFrame.query (). I think it should be indexed normally.

`Time[004]`


%timeit df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']].query('customer_id == "CS018205000001"')
# 11.6 ms ± 477 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df_receipt.loc[df_receipt['customer_id'] == 'CS018205000001', ['sales_ymd', 'customer_id', 'product_cd', 'amount']]
# 9.49 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit arr_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']][arr_receipt['customer_id'] == 'CS018205000001']
# 2.7 ms ± 475 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

P_005

P-005: Data that meets the following conditions by specifying columns in the order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd), and sales amount (amount) from the receipt statement data frame (df_receipt). Extract.

--Customer ID (customer_id) is "CS018205000001" --Sales amount (amount) is 1,000 or more

It is the same even if there are multiple conditions.

`In[005]`


arr_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']][
    (arr_receipt['customer_id'] == 'CS018205000001') & (arr_receipt['amount'] >= 1000)]

`Out[005]`


array([(20180911, 'CS018205000001', 'P071401012', 2200),
       (20190226, 'CS018205000001', 'P071401020', 2200),
       (20180911, 'CS018205000001', 'P071401005', 1100)],
      dtype={'names':['sales_ymd','customer_id','product_cd','amount'], 'formats':['<i4','<U14','<U10','<i4'], 'offsets':[0,40,96,140], 'itemsize':144})

P_006

P-006: Specify the columns in the order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd), sales quantity (quantity), sales amount (amount) from the receipt detail data frame "df_receipt", and then Extract data that meets the conditions of.

--Customer ID (customer_id) is "CS018205000001" --Sales amount (amount) is 1,000 or more or sales quantity (quantity) is 5 or more

Somehow the line is getting longer, so I will divide it.

`In[006]`


col_list = ['sales_ymd', 'customer_id', 'product_cd', 'quantity', 'amount']
cond = ((arr_receipt['customer_id'] == 'CS018205000001')
        & ((arr_receipt['amount'] >= 1000) | (arr_receipt['quantity'] >= 5)))

arr_receipt[col_list][cond]

`Out[006]`


array([(20180911, 'CS018205000001', 'P071401012', 2200),
       (20180414, 'CS018205000001', 'P060104007',  600),
       (20170614, 'CS018205000001', 'P050206001',  990),
       (20190226, 'CS018205000001', 'P071401020', 2200),
       (20180911, 'CS018205000001', 'P071401005', 1100)],
      dtype={'names':['sales_ymd','customer_id','product_cd','amount'], 'formats':['<i4','<U14','<U10','<i4'], 'offsets':[0,40,96,140], 'itemsize':144})

P_007

P-007: Data that meets the following conditions by specifying columns in the order of receipt details data frame (df_receipt), sales date (sales_ymd), customer ID (customer_id), product code (product_cd), and sales amount (amount). Extract.

--Customer ID (customer_id) is "CS018205000001" --Sales amount (amount) is 1,000 or more and 2,000 or less

`In[007]`


col_list = ['sales_ymd', 'customer_id', 'product_cd', 'quantity', 'amount']
cond = ((arr_receipt['customer_id'] == 'CS018205000001')
        & ((1000 <= arr_receipt['amount']) & (arr_receipt['amount'] <= 2000)))

arr_receipt[col_list][cond]

`Out[007]`


array([(20180911, 'CS018205000001', 'P071401005', 1, 1100)],
      dtype={'names':['sales_ymd','customer_id','product_cd','quantity','amount'], 'formats':['<i4','<U14','<U10','<i4','<i4'], 'offsets':[0,40,96,136,140], 'itemsize':144})

P_008

P-008: Data that meets the following conditions by specifying columns in the order of receipt details data frame (df_receipt), sales date (sales_ymd), customer ID (customer_id), product code (product_cd), and sales amount (amount). Extract.

--Customer ID (customer_id) is "CS018205000001" --Product code (product_cd) is other than "P071401019"

`In[008]`


col_list = ['sales_ymd', 'customer_id', 'product_cd', 'quantity', 'amount']
cond = ((arr_receipt['customer_id'] == 'CS018205000001')
        & (arr_receipt['product_cd'] != 'P071401019'))

arr_receipt[col_list][cond]

`Out[008]`


array([(20180911, 'CS018205000001', 'P071401012', 1, 2200),
       (20180414, 'CS018205000001', 'P060104007', 6,  600),
       (20170614, 'CS018205000001', 'P050206001', 5,  990),
       (20170614, 'CS018205000001', 'P060702015', 1,  108),
       (20190216, 'CS018205000001', 'P071005024', 1,  102),
       (20180414, 'CS018205000001', 'P071101002', 1,  278),
       (20190226, 'CS018205000001', 'P070902035', 1,  168),
       (20190924, 'CS018205000001', 'P060805001', 1,  495),
       (20190226, 'CS018205000001', 'P071401020', 1, 2200),
       (20180911, 'CS018205000001', 'P071401005', 1, 1100),
       (20190216, 'CS018205000001', 'P040101002', 1,  218),
       (20190924, 'CS018205000001', 'P091503001', 1,  280)],
      dtype={'names':['sales_ymd','customer_id','product_cd','quantity','amount'], 'formats':['<i4','<U14','<U10','<i4','<i4'], 'offsets':[0,40,96,136,140], 'itemsize':144})

`Time[008]`


col_list = ['sales_ymd', 'customer_id', 'product_cd', 'quantity', 'amount']

%timeit df_receipt[col_list].query('customer_id == "CS018205000001" & product_cd != "P071401019"')
# 15.6 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit arr_receipt[col_list][((arr_receipt['customer_id'] == 'CS018205000001') & (arr_receipt['product_cd'] != 'P071401019'))]
# 4.28 ms ± 86.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

in conclusion

Hey This is high-speed, high-speed, data processing made with NumPy. You can spit it out to csv after processing, and pandas will not do heavy processing. That's why I quit pandas