That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 4]

We will solve the Python problem of Data Science 100 Knock (Structured Data Processing). This group of questions uses pandas for data processing in the model answer, but we will process it using NumPy after studying.

: arrow_up: First article (# 1) : arrow_backward: Previous article (# 3) : arrow_forward: Next article (# 5)

Introduction

As a study of NumPy, I will solve the Python problem of Data Science 100 Knock (Structured Data Processing).

Many people who do data science in Python may be pandas lovers, but in fact ** you can do the same with NumPy without using pandas **. And NumPy is usually faster. As a person who loves pandas, I'm still not used to operating NumPy, so I'd like to try to graduate from pandas by operating this "Data Science 100 Knock" with NumPy this time.

This time I will do the 23rd to 35th questions. It seems to be a theme called Group By. The initial data was read as follows (data type specification is postponed for the time being).

import numpy as np
import pandas as pd
from numpy.lib import recfunctions as rfn

#For model answer
df_customer = pd.read_csv('data/customer.csv')
df_receipt = pd.read_csv('data/receipt.csv')

#Data we handle
arr_customer = np.genfromtxt(
    'data/customer.csv', delimiter=',', encoding='utf-8',
    names=True, dtype=None)
arr_receipt = np.genfromtxt(
    'data/receipt.csv', delimiter=',', encoding='utf-8',
    names=True, dtype=None)

When it comes to this theme, the inefficiency of the structured array that just reads csv as it is becomes noticeable, so I stopped using this this time: stuck_out_tongue_winking_eye :. In order to operate the NumPy array efficiently, there are two points that are important: the array is "numerical data" and "the memory layout is optimized". Therefore, we will introduce the following function.

def array_to_dict(arr):
    dic = dict()
    for colname in arr.dtype.names:
        if np.issubdtype(arr[colname].dtype, np.number):
            #For numerical data, optimize the memory layout and store it in the dictionary
            dic[colname] = np.ascontiguousarray(arr[colname])
        else:
            #In the case of character string data, convert it to numerical data and store it in the dictionary.
            unq, inv = np.unique(arr[colname], return_inverse=True)
            dic[colname] = inv
            dic['Code_' + colname] = unq  #Array used to convert numbers back to strings
    return dic


dic_customer = array_to_dict(arr_customer)
dic_receipt = array_to_dict(arr_receipt)

dic_receipt

This function returns the table as a dictionary. There is an ndarray for each column in the table.

{'sales_ymd':
    array([20181103, 20181118, 20170712, ..., 20170311, 20170331, 20190423]),
 'sales_epoch':
    array([1257206400, 1258502400, 1215820800, ..., 1205193600, 1206921600, 1271980800]),
 'store_cd':
    array([29, 10, 40, ..., 44,  6, 13], dtype=int64),
 'Code_store_cd':
    array(['S12007', 'S12013', 'S12014', ..., 'S14048', 'S14049', 'S14050'], dtype='<U6'),
 'receipt_no':
    array([ 112, 1132, 1102, ..., 1122, 1142, 1102]),
 'receipt_sub_no':
    array([1, 2, 1, ..., 1, 1, 2]),
 'customer_id':
    array([1689, 2112, 5898, ..., 8103,  582, 8306], dtype=int64),
 'Code_customer_id':
    array(['CS001113000004', 'CS001114000005', 'CS001115000010', ..., 'CS052212000002', 'CS052514000001', 'ZZ000000000000'], dtype='<U14'),
 'product_cd':
    array([2119, 3235,  861, ...,  457, 1030,  808], dtype=int64),
 'Code_product_cd':
    array(['P040101001', 'P040101002', 'P040101003', ..., 'P091503003', 'P091503004', 'P091503005'], dtype='<U10'),
 'quantity':
    array([1, 1, 1, ..., 1, 1, 1]),
 'amount':
    array([158,  81, 170, ..., 168, 148, 138])}

Let's try with np.argsort () how important this work is.

arr_receipt['sales_ymd'].flags['C_CONTIGUOUS']
# False
dic_receipt['sales_ymd'].flags['C_CONTIGUOUS']
# True

%timeit np.argsort(arr_receipt['sales_ymd'])
# 7.33 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.argsort(dic_receipt['sales_ymd'])
# 5.98 ms ± 58.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This is good. It seems that somehow the structured array and the convenience are ignored and it is no longer easy to handle, but since this article is about throwing away pandas and studying NumPy. No problem! : hugging:

By the way, the code that creates the structured array, which was messed up last time, is also made into a function.

def make_array(size, **kwargs):
    arr = np.empty(size, dtype=[(colname, subarr.dtype)
                                for colname, subarr in kwargs.items()])
    for colname, subarr in kwargs.items():
        arr[colname] = subarr
    return arr

P_023

P-023: Sum the sales amount (amount) and sales quantity (quantity) for each store code (store_cd) for the receipt detail data frame (df_receipt).

Use np.bincount () to find the sum for each value in the store code column. This is a technique that can be used only because the store code string is converted from characters to numerical data.

`In[023]`


make_array(
    dic_receipt['Code_store_cd'].size,
    store_cd=dic_receipt['Code_store_cd'],
    amount=np.bincount(dic_receipt['store_cd'], dic_receipt['amount']),
    quantity=np.bincount(dic_receipt['store_cd'], dic_receipt['quantity']))

`Out[023]`


array([('S12007', 638761., 2099.), ('S12013', 787513., 2425.),
       ('S12014', 725167., 2358.), ('S12029', 794741., 2555.),
       ('S12030', 684402., 2403.), ('S13001', 811936., 2347.),
       ('S13002', 727821., 2340.), ('S13003', 764294., 2197.),
       ('S13004', 779373., 2390.), ('S13005', 629876., 2004.),
       ('S13008', 809288., 2491.), ('S13009', 808870., 2486.),
       ('S13015', 780873., 2248.), ('S13016', 793773., 2432.),
       ('S13017', 748221., 2376.), ('S13018', 790535., 2562.),
       ('S13019', 827833., 2541.), ('S13020', 796383., 2383.),
       ('S13031', 705968., 2336.), ('S13032', 790501., 2491.),
       ('S13035', 715869., 2219.), ('S13037', 693087., 2344.),
       ('S13038', 708884., 2337.), ('S13039', 611888., 1981.),
       ('S13041', 728266., 2233.), ('S13043', 587895., 1881.),
       ('S13044', 520764., 1729.), ('S13051', 107452.,  354.),
       ('S13052', 100314.,  250.), ('S14006', 712839., 2284.),
       ('S14010', 790361., 2290.), ('S14011', 805724., 2434.),
       ('S14012', 720600., 2412.), ('S14021', 699511., 2231.),
       ('S14022', 651328., 2047.), ('S14023', 727630., 2258.),
       ('S14024', 736323., 2417.), ('S14025', 755581., 2394.),
       ('S14026', 824537., 2503.), ('S14027', 714550., 2303.),
       ('S14028', 786145., 2458.), ('S14033', 725318., 2282.),
       ('S14034', 653681., 2024.), ('S14036', 203694.,  635.),
       ('S14040', 701858., 2233.), ('S14042', 534689., 1935.),
       ('S14045', 458484., 1398.), ('S14046', 412646., 1354.),
       ('S14047', 338329., 1041.), ('S14048', 234276.,  769.),
       ('S14049', 230808.,  788.), ('S14050', 167090.,  580.)],
      dtype=[('store_cd', '<U6'), ('amount', '<f8'), ('quantity', '<f8')])

`Time[023]`


#Model answer
%%timeit
df_receipt.groupby('store_cd').agg({'amount':'sum', 'quantity':'sum'}).reset_index()
# 9.14 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
make_array(
    dic_receipt['Code_store_cd'].size,
    store_cd=dic_receipt['Code_store_cd'],
    amount=np.bincount(dic_receipt['store_cd'], dic_receipt['amount']),
    quantity=np.bincount(dic_receipt['store_cd'], dic_receipt['quantity']))
# 473 µs ± 19.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

P_024

P-024: For the receipt detail data frame (df_receipt), find the newest sales date (sales_ymd) for each customer ID (customer_id) and display 10 items.

Use np.maximum () to find the maximum sales date. First, sort the sales dates based on the customer ID column. Next, get the position of the line where the customer ID changes, and use np.ufunc.reduceat () to perform the same processing for each customer ID (np.maximum ()).

`In[024]`


sorter_index = np.argsort(dic_receipt['customer_id'])
sorted_id = dic_receipt['customer_id'][sorter_index]
sorted_ymd = dic_receipt['sales_ymd'][sorter_index]

cut_index, = np.concatenate(
    ([True], sorted_id[1:] != sorted_id[:-1])).nonzero()

make_array(
    cut_index.size,
    customer_id=dic_receipt['Code_customer_id'],
    sales_ymd=np.maximum.reduceat(sorted_ymd, cut_index))[:10]

`Out[024]`


array([('CS001113000004', 20190308), ('CS001114000005', 20190731),
       ('CS001115000010', 20190405), ('CS001205000004', 20190625),
       ('CS001205000006', 20190224), ('CS001211000025', 20190322),
       ('CS001212000027', 20170127), ('CS001212000031', 20180906),
       ('CS001212000046', 20170811), ('CS001212000070', 20191018)],
      dtype=[('customer_id', '<U14'), ('sales_ymd', '<i4')])

P_025

P-025: For the receipt detail data frame (df_receipt), find the oldest sales date (sales_ymd) for each customer ID (customer_id) and display 10 items.

`In[025]`


sorter_index = np.argsort(dic_receipt['customer_id'])
sorted_id = dic_receipt['customer_id'][sorter_index]
sorted_ymd = dic_receipt['sales_ymd'][sorter_index]

cut_index, = np.concatenate(
    ([True], sorted_id[1:] != sorted_id[:-1])).nonzero()

make_array(
    cut_index.size,
    customer_id=dic_receipt['Code_customer_id'],
    sales_ymd=np.minimum.reduceat(sorted_ymd, cut_index))[:10]

`Out[025]`


array([('CS001113000004', 20190308), ('CS001114000005', 20180503),
       ('CS001115000010', 20171228), ('CS001205000004', 20170914),
       ('CS001205000006', 20180207), ('CS001211000025', 20190322),
       ('CS001212000027', 20170127), ('CS001212000031', 20180906),
       ('CS001212000046', 20170811), ('CS001212000070', 20191018)],
      dtype=[('customer_id', '<U14'), ('sales_ymd', '<i4')])

P_026

P-026: For the receipt detail data frame (df_receipt), find the newest sales date (sales_ymd) and the oldest sales date for each customer ID (customer_id), and display 10 different data.

`In[026]`


sorter_index = np.argsort(dic_receipt['customer_id'])
sorted_id = dic_receipt['customer_id'][sorter_index]
sorted_ymd = dic_receipt['sales_ymd'][sorter_index]

cut_index, = np.concatenate(
    ([True], sorted_id[1:] != sorted_id[:-1])).nonzero()
sales_ymd_max = np.maximum.reduceat(sorted_ymd, cut_index)
sales_ymd_min = np.minimum.reduceat(sorted_ymd, cut_index)

new_arr = make_array(cut_index.size,
                     customer_id=dic_receipt['Code_customer_id'],
                     sales_ymd_max=sales_ymd_max, sales_ymd_min=sales_ymd_min)
new_arr[sales_ymd_max != sales_ymd_min][:10]

`Out[026]`


array([('CS001114000005', 20190731, 20180503),
       ('CS001115000010', 20190405, 20171228),
       ('CS001205000004', 20190625, 20170914),
       ('CS001205000006', 20190224, 20180207),
       ('CS001214000009', 20190902, 20170306),
       ('CS001214000017', 20191006, 20180828),
       ('CS001214000048', 20190929, 20171109),
       ('CS001214000052', 20190617, 20180208),
       ('CS001215000005', 20181021, 20170206),
       ('CS001215000040', 20171022, 20170214)],
      dtype=[('customer_id', '<U14'), ('sales_ymd_max', '<i4'), ('sales_ymd_min', '<i4')])

P_027

P-027: Calculate the average sales amount (amount) for each store code (store_cd) for the receipt detail data frame (df_receipt), and display the TOP5 in descending order.

Use np.bincount () to calculate the total number and amount for each store code, and calculate the average by total ÷ number.

`In[027]`


mean_amount = (np.bincount(dic_receipt['store_cd'], dic_receipt['amount'])
               / np.bincount(dic_receipt['store_cd']))

new_arr = make_array(dic_receipt['Code_store_cd'].size,
                     store_cd=dic_receipt['Code_store_cd'],
                     amount=mean_amount)
new_arr[np.argsort(mean_amount)[::-1]][:5]

`Out[027]`


array([('S13052', 402.86746988), ('S13015', 351.11196043),
       ('S13003', 350.91551882), ('S14010', 348.79126214),
       ('S13001', 348.47038627)],
      dtype=[('store_cd', '<U6'), ('amount', '<f8')])

P_028

P-028: Calculate the median sales amount (amount) for each store code (store_cd) for the receipt statement data frame (df_receipt), and display the TOP5 in descending order.

Operate in the same way as when finding the maximum and minimum values, and loop np.median () for each store code.

`In[028]`


sorter_index = np.argsort(dic_receipt['store_cd'])
sorted_cd = dic_receipt['store_cd'][sorter_index]
sorted_amount = dic_receipt['amount'][sorter_index]

cut_index, = np.concatenate(
    ([True], sorted_cd[1:] != sorted_cd[:-1], [True])).nonzero()
median_amount = np.array([np.median(sorted_amount[s:e])
                          for s, e in zip(cut_index[:-1], cut_index[1:])])

new_arr = make_array(dic_receipt['Code_store_cd'].size,
                     store_cd=dic_receipt['Code_store_cd'],
                     amount=median_amount)
new_arr[np.argsort(median_amount)[::-1]][:5]

As shown below, there is also a method of creating an array arranged in the order of "store code → sales amount" using np.lexsort () and counting the number of each store code to obtain the median index. np.lexsort () was too late.

#Sort array
sorter_index = np.lexsort((dic_receipt['amount'], dic_receipt['store_cd']))
sorted_cd = dic_receipt['store_cd'][sorter_index]
sorted_amount = dic_receipt['amount'][sorter_index]

#Find the median index
counts = np.bincount(sorted_cd)
median_index = counts//2
median_index[1:] += counts.cumsum()[:-1]

#Calculate median
med_a = sorted_amount[median_index]
med_b = sorted_amount[median_index - 1]
median_amount = np.where(counts % 2, med_a, (med_a+med_b)/2)

`Out[028]`


array([('S13052', 190.), ('S14010', 188.), ('S14050', 185.),
       ('S14040', 180.), ('S13003', 180.)],
      dtype=[('store_cd', '<U6'), ('amount', '<f8')])

P_029

P-029: Find the mode of the product code (product_cd) for each store code (store_cd) for the receipt details data frame (df_receipt).

Create a 2D plane mapping with the store code and product code, and usenp.add.at ()to add 1 to each line. Then, use np.argmax () to get the product code that takes the maximum value for each store code.

`In[029]`


mapping = np.zeros((dic_receipt['Code_store_cd'].size,
                    dic_receipt['Code_product_cd'].size), dtype=int)
np.add.at(mapping, (dic_receipt['store_cd'], dic_receipt['product_cd']), 1)

make_array(
    dic_receipt['Code_store_cd'].size,
    store_cd=dic_receipt['Code_store_cd'],
    product_cd=dic_receipt['Code_product_cd'][np.argmax(mapping, axis=1)])

`Out[029]`


array([('S12007', 'P060303001'), ('S12013', 'P060303001'),
       ('S12014', 'P060303001'), ('S12029', 'P060303001'),
       ('S12030', 'P060303001'), ('S13001', 'P060303001'),
       ('S13002', 'P060303001'), ('S13003', 'P071401001'),
       ('S13004', 'P060303001'), ('S13005', 'P040503001'),
       ('S13008', 'P060303001'), ('S13009', 'P060303001'),
       ('S13015', 'P071401001'), ('S13016', 'P071102001'),
       ('S13017', 'P060101002'), ('S13018', 'P071401001'),
       ('S13019', 'P071401001'), ('S13020', 'P071401001'),
       ('S13031', 'P060303001'), ('S13032', 'P060303001'),
       ('S13035', 'P040503001'), ('S13037', 'P060303001'),
       ('S13038', 'P060303001'), ('S13039', 'P071401001'),
       ('S13041', 'P071401001'), ('S13043', 'P060303001'),
       ('S13044', 'P060303001'), ('S13051', 'P050102001'),
       ('S13052', 'P050101001'), ('S14006', 'P060303001'),
       ('S14010', 'P060303001'), ('S14011', 'P060101001'),
       ('S14012', 'P060303001'), ('S14021', 'P060101001'),
       ('S14022', 'P060303001'), ('S14023', 'P071401001'),
       ('S14024', 'P060303001'), ('S14025', 'P060303001'),
       ('S14026', 'P071401001'), ('S14027', 'P060303001'),
       ('S14028', 'P060303001'), ('S14033', 'P071401001'),
       ('S14034', 'P060303001'), ('S14036', 'P040503001'),
       ('S14040', 'P060303001'), ('S14042', 'P050101001'),
       ('S14045', 'P060303001'), ('S14046', 'P060303001'),
       ('S14047', 'P060303001'), ('S14048', 'P050101001'),
       ('S14049', 'P060303001'), ('S14050', 'P060303001')],
      dtype=[('store_cd', '<U6'), ('product_cd', '<U10')])

P_030

P-030: For the receipt detail data frame (df_receipt), calculate the sample variance of the sales amount (amount) for each store code (store_cd), and display the TOP5 in descending order.

First, use np.bincount () to calculate the total number and amount for each store code. Next, calculate the average by total ÷ number. Next, calculate the deviation and use np.bincount () to calculate the variance from the total ÷ number for each store code.

`In[030]`


counts = np.bincount(dic_receipt['store_cd'])
mean_amount = (np.bincount(dic_receipt['store_cd'], dic_receipt['amount'])
               / counts)
deviation_array = mean_amount[dic_receipt['store_cd']] - dic_receipt['amount']
var_amount = np.bincount(dic_receipt['store_cd'], deviation_array**2) / counts

new_arr = make_array(dic_receipt['Code_store_cd'].size,
                     store_cd=dic_receipt['Code_store_cd'],
                     amount=var_amount)
new_arr[np.argsort(var_amount)[::-1]][:5]

`Out[030]`


array([('S13052', 440088.70131127), ('S14011', 306314.55816389),
       ('S14034', 296920.08101128), ('S13001', 295431.99332904),
       ('S13015', 295294.36111594)],
      dtype=[('store_cd', '<U6'), ('amount', '<f8')])

P_031

P-031: Calculate the sample standard deviation of the sales amount (amount) for each store code (store_cd) for the receipt detail data frame (df_receipt), and display the TOP5 in descending order.

Since the square root of the variance is the standard deviation, apply np.sqrt () to the previous question.

`In[031]`


counts = np.bincount(dic_receipt['store_cd'])
mean_amount = (np.bincount(dic_receipt['store_cd'], dic_receipt['amount'])
               / counts)
deviation_array = mean_amount[dic_receipt['store_cd']] - dic_receipt['amount']
var_amount = np.bincount(dic_receipt['store_cd'], deviation_array**2) / counts

new_arr = make_array(dic_receipt['Code_store_cd'].size,
                     store_cd=dic_receipt['Code_store_cd'],
                     amount=np.sqrt(var_amount))
new_arr[np.argsort(var_amount)[::-1]][:5]

`Out[031]`


array([('S13052', 663.39181583), ('S14011', 553.45691627),
       ('S14034', 544.90373555), ('S13001', 543.53656117),
       ('S13015', 543.40993837)],
      dtype=[('store_cd', '<U6'), ('amount', '<f8')])

P_032

P-032: Find the percentile value of the sales amount (amount) of the receipt detail data frame (df_receipt) in 25% increments.

`In[032]`


np.percentile(dic_receipt['amount'], np.arange(5)/4)

`Out[032]`


array([ 10., 102., 170., 288., 10925.])

P_033

P-033: Calculate the average sales amount (amount) for each store code (store_cd) for the receipt detail data frame (df_receipt), and extract 330 or more.

Same as question 27.

`In[033]`


mean_amount = (np.bincount(dic_receipt['store_cd'], dic_receipt['amount'])
               / np.bincount(dic_receipt['store_cd']))

new_arr = make_array(dic_receipt['Code_store_cd'].size,
                     store_cd=dic_receipt['Code_store_cd'],
                     amount=mean_amount)
new_arr[mean_amount >= 330]

`Out[033]`


array([('S12013', 330.19412998), ('S13001', 348.47038627),
       ('S13003', 350.91551882), ('S13004', 330.94394904),
       ('S13015', 351.11196043), ('S13019', 330.20861588),
       ('S13020', 337.87993212), ('S13052', 402.86746988),
       ('S14010', 348.79126214), ('S14011', 335.71833333),
       ('S14026', 332.34058847), ('S14045', 330.08207343),
       ('S14047', 330.07707317)],
      dtype=[('store_cd', '<U6'), ('amount', '<f8')])

P_034

P-034: For the receipt detail data frame (df_receipt), add up the sales amount (amount) for each customer ID (customer_id) and calculate the average of all customers. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation.

First, non-members are judged using .astype () Strategy for the column for "character string ⇔ number" conversion, and . Extract that value with .nonzero () . Then use np.in1d () to create a boolean array ʻis_member` that determines if each row in the customer ID column is a member. Finally, add up by customer ID and average it.

`In[034]`


startswithZ, = (dic_receipt['Code_customer_id'].astype('<U1') == 'Z').nonzero()
is_member = np.in1d(dic_receipt['customer_id'], startswithZ, invert=True)

sums = np.bincount(dic_receipt['customer_id'][is_member],
                   dic_receipt['amount'][is_member])

np.mean(sums)

`Out[034]`


2547.742234529256

P_035

P-035: For the receipt detail data frame (df_receipt), add up the sales amount (amount) for each customer ID (customer_id) to find the average of all customers, and extract the customers who shop above the average. .. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation. Only 10 data items need to be displayed.

Just add the extraction process to the previous question.

`In[035]`


startswithZ, = (dic_receipt['Code_customer_id'].astype('<U1') == 'Z').nonzero()
is_member = np.in1d(dic_receipt['customer_id'], startswithZ, invert=True)

sums = np.bincount(dic_receipt['customer_id'][is_member],
                   dic_receipt['amount'][is_member])
mean = np.mean(sums)

new_arr = make_array(dic_receipt['Code_customer_id'].size - startswithZ.size,
                     store_cd=dic_receipt['Code_customer_id'][~startswithZ],
                     amount=sums)
new_arr[sums > mean][:10]

`Out[035]`


array([('CS001113000004', 3044.), ('CS001113000004', 3337.),
       ('CS001113000004', 4685.), ('CS001113000004', 4132.),
       ('CS001113000004', 5639.), ('CS001113000004', 3496.),
       ('CS001113000004', 3726.), ('CS001113000004', 3485.),
       ('CS001113000004', 4370.), ('CS001113000004', 3300.)],
      dtype=[('store_cd', '<U14'), ('amount', '<f8')])

By the way, in the model answer, for some reason, df_receipt.groupby ('customer_id'). Amount.sum () is performed twice.

in conclusion

NumPy doesn't have a group by, so it was a tough fight.