We will solve the Python problem of Data Science 100 Knock (Structured Data Processing). This group of questions uses pandas for data processing in the model answer, but we will process it using NumPy after studying.
: arrow_up: First article (# 1) : arrow_backward: Previous article (# 2) : arrow_forward: Next article (# 4)
As a study of NumPy, I will solve the Python problem of Data Science 100 Knock (Structured Data Processing).
Many people who do data science in Python may be pandas lovers, but in fact ** you can do the same with NumPy without using pandas **. And NumPy is usually faster. As a person who loves pandas, I'm still not used to operating NumPy, so I'd like to try to graduate from pandas by operating this "Data Science 100 Knock" with NumPy this time.
This time, I will ask the 17th to 22nd questions. This area seems to be the theme of sorting and ranking.
The initial data was read as follows.
It is a policy not to vectorize functions by np.vectorize ()
or np.frompyfunc ()
.
import numpy as np
import pandas as pd
from numpy.lib import recfunctions as rfn
#For model answer
df_customer = pd.read_csv('data/customer.csv')
df_receipt = pd.read_csv('data/receipt.csv')
#Data we handle
arr_customer = np.genfromtxt(
'data/customer.csv', delimiter=',', encoding='utf-8',
names=True, dtype=None)
arr_receipt = np.genfromtxt(
'data/receipt.csv', delimiter=',', encoding='utf-8',
names=True, dtype=None)
P_017
P-17: Sort the customer data frame (df_customer) by date of birth (birth_day) in chronological order, and display all the first 10 items.
If you want to sort the array, use np.sort ()
, which will return the sorted array. This time I want the sort order, so use np.argsort ()
or np.ndarray.argsort ()
. There is also a way to pass the column name to the ʻorder argument of
np.sort () `, but it is not recommended because the behavior is not good.
In[017]
arr_customer[arr_customer['birth_day'].argsort()][:10]
Out[017]
array([('CS003813000014', 'Nanami Murayama', 1, 'Female', '1928-11-26', 90, '182-0007', 'Kikunodai, Chofu-shi, Tokyo**********', 'S13003', 20160214, '0-00000000-0'),
('CS026813000004', 'Yoshimura Chaoyang', 1, 'Female', '1928-12-14', 90, '251-0043', 'Tsujido Motomachi, Fujisawa City, Kanagawa Prefecture**********', 'S14026', 20150723, '0-00000000-0'),
('CS018811000003', 'Misato Kumazawa', 1, 'Female', '1929-01-07', 90, '204-0004', 'Noshio, Kiyose City, Tokyo**********', 'S13018', 20150403, '0-00000000-0'),
('CS027803000004', 'Takuro Uchimura', 0, 'male', '1929-01-12', 90, '251-0031', 'Kugenumafujigaya, Fujisawa City, Kanagawa Prefecture**********', 'S14027', 20151227, '0-00000000-0'),
('CS013801000003', 'Takuro Amano', 0, 'male', '1929-01-15', 90, '274-0824', 'Maebarahigashi, Funabashi City, Chiba Prefecture**********', 'S12013', 20160120, '0-00000000-0'),
('CS001814000022', 'Riho Tsuruta', 1, 'Female', '1929-01-28', 90, '144-0045', 'Minamirokugo, Ota-ku, Tokyo**********', 'S13001', 20161012, 'A-20090415-7'),
('CS016815000002', 'Miki Yamamoto', 1, 'Female', '1929-02-22', 90, '184-0005', 'Sakuracho, Koganei City, Tokyo**********', 'S13016', 20150629, 'C-20090923-C'),
('CS009815000003', 'Riho Nakata', 1, 'Female', '1929-04-08', 89, '154-0014', 'Shinmachi, Setagaya-ku, Tokyo**********', 'S13009', 20150421, 'D-20091021-E'),
('CS005813000015', 'Erika Kanaya', 1, 'Female', '1929-04-09', 89, '165-0032', 'Saginomiya, Nakano-ku, Tokyo**********', 'S13005', 20150506, '0-00000000-0'),
('CS012813000013', 'Uno Minami Tomo', 1, 'Female', '1929-04-09', 89, '231-0806', 'Honmoku-cho, Naka-ku, Yokohama-shi, Kanagawa**********', 'S14012', 20150712, '0-00000000-0')],
dtype=[('customer_id', '<U14'), ('customer_name', '<U10'), ('gender_cd', '<i4'), ('gender', '<U2'), ('birth_day', '<U10'), ('age', '<i4'), ('postal_cd', '<U8'), ('address', '<U26'), ('application_store_cd', '<U6'), ('application_date', '<i4'), ('status_cd', '<U12')])
Time[017]
%timeit df_customer.sort_values('birth_day', ascending=True).head(10)
# 21.8 ms ± 346 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit arr_customer[arr_customer['birth_day'].argsort()][:10]
# 17.1 ms ± 473 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
P_018
P-18: Sort the customer data frame (df_customer) by date of birth (birth_day) in ascending order, and display all the first 10 items.
Just reverse the order.
In[018]
arr_customer[arr_customer['birth_day'].argsort()][::-1][:10]
Out[018]
array([('CS035114000004', 'Misato Omura', 1, 'Female', '2007-11-25', 11, '156-0053', 'Sakura, Setagaya-ku, Tokyo**********', 'S13035', 20150619, '6-20091205-6'),
('CS022103000002', 'Fukuyama Hajime', 9, 'unknown', '2007-10-02', 11, '249-0006', 'Zushi, Zushi City, Kanagawa Prefecture**********', 'S14022', 20160909, '0-00000000-0'),
('CS002113000009', 'Mayuko Shibata', 1, 'Female', '2007-09-17', 11, '184-0014', 'Nukuiminami-cho, Koganei-shi, Tokyo**********', 'S13002', 20160304, '0-00000000-0'),
('CS004115000014', 'Kyoko Matsui', 1, 'Female', '2007-08-09', 11, '165-0031', 'Kamisaginomiya, Nakano-ku, Tokyo**********', 'S13004', 20161120, '1-20081231-1'),
('CS002114000010', 'Haruka Yamauchi', 1, 'Female', '2007-06-03', 11, '184-0015', 'Nukui Kitamachi, Koganei City, Tokyo**********', 'S13002', 20160920, '6-20100510-1'),
('CS025115000002', 'Natsuki Koyanagi', 1, 'Female', '2007-04-18', 11, '245-0018', 'Kamiiida-cho, Izumi-ku, Yokohama-shi, Kanagawa**********', 'S14025', 20160116, 'D-20100913-D'),
('CS002113000025', 'Hirosue Manami', 1, 'Female', '2007-03-30', 12, '184-0015', 'Nukui Kitamachi, Koganei City, Tokyo**********', 'S13002', 20171030, '0-00000000-0'),
('CS033112000003', 'Miki Nagano', 1, 'Female', '2007-03-22', 12, '245-0051', 'Nase-cho, Totsuka-ku, Yokohama-shi, Kanagawa**********', 'S14033', 20150606, '0-00000000-0'),
('CS007115000006', 'Fukuoka Shun', 1, 'Female', '2007-03-10', 12, '285-0845', 'Nishishizu, Sakura City, Chiba Prefecture**********', 'S12007', 20151118, 'F-20101016-F'),
('CS014113000008', 'Rio Yaguchi', 1, 'Female', '2007-03-05', 12, '260-0041', 'Higashichiba, Chuo-ku, Chiba City, Chiba Prefecture**********', 'S12014', 20150622, '3-20091108-6')],
dtype=[('customer_id', '<U14'), ('customer_name', '<U10'), ('gender_cd', '<i4'), ('gender', '<U2'), ('birth_day', '<U10'), ('age', '<i4'), ('postal_cd', '<U8'), ('address', '<U26'), ('application_store_cd', '<U6'), ('application_date', '<i4'), ('status_cd', '<U12')])
P_019
P-19: Give ranks to the receipt detail data frame (df_receipt) in descending order of sales amount (amount) per item, and extract the first 10 items. Items should display customer ID (customer_id), sales amount (amount), and assigned rank. If the sales amount (amount) is the same, the same ranking shall be given.
It's easy to sort by highest sales amount.
sorted_array = arr_receipt[['customer_id', 'amount']][
arr_receipt['amount'].argsort()[::-1]]
sorted_array[:10]
# array([('CS011415000006', 10925), ('ZZ000000000000', 6800),
# ('CS028605000002', 5780), ('ZZ000000000000', 5480),
# ('ZZ000000000000', 5480), ('CS015515000034', 5480),
# ('CS021515000089', 5440), ('ZZ000000000000', 5440),
# ('ZZ000000000000', 5280), ('ZZ000000000000', 5280)],
# dtype={'names':['customer_id','amount'], 'formats':['<U14','<i4'], 'offsets':[40,140], 'itemsize':144})
Rank this sales amount. The easiest way is to use scipy.stats.rankdata ()
. This time we ask for the most, but this function returns the ranking in ascending order, so subtract it from the maximum value and flip it over.
import scipy.stats
rank_asc = scipy.stats.rankdata(sorted_array['amount'], 'max')
rank = rank_asc.max() - rank_asc + 1
rank[:10]
# array([1, 2, 3, 4, 4, 4, 7, 7, 9, 9], dtype=int64)
If you don't use the scipy.stats.rankdata ()
function, it checks for the existence of ties and says "what is the largest value in each row" and "what is the nth largest value". Create an array to create a rank array.
#An array of False if it has the same rank as the next higher row, and True otherwise.
rank_cutidx = np.concatenate(
([True], sorted_array['amount'][1:] != sorted_array['amount'][:-1]))
# rank_cutidx.cumsum()-1: An array showing the highest value of each row
# rank_cutidx.nonzero(): An array indicating the number of each "nth largest value"
rank = rank_cutidx.nonzero()[0][rank_cutidx.cumsum()-1] + 1
rank[:10]
# array([1, 2, 3, 4, 4, 4, 7, 7, 9, 9], dtype=int64)
I use numpy.lib.recfunctions.append_fields ()
to add columns, but this function was extremely slow (it seems to recreate a new array from scratch).
In[019]
sorted_array = arr_receipt[['customer_id', 'amount']][
arr_receipt['amount'].argsort()[::-1]]
rank_cutidx = np.concatenate(
([True], sorted_array['amount'][1:] != sorted_array['amount'][:-1]))
rank = rank_cutidx.nonzero()[0][rank_cutidx.cumsum()-1]+1
rfn.append_fields(sorted_array, 'ranking', rank,
dtypes=rank.dtype, usemask=False)[:10]
Then you should prepare a blank table by yourself ...?
In[019]
sorter_index = arr_receipt['amount'].argsort()[::-1]
sorted_id = arr_receipt['customer_id'][sorter_index]
sorted_amount = arr_receipt['amount'][sorter_index]
rank_cutidx = np.concatenate(
([True], sorted_amount[1:] != sorted_amount[:-1]))
rank = rank_cutidx.nonzero()[0][rank_cutidx.cumsum()-1]+1
#Array creation
new_arr = np.empty(arr_receipt.size, dtype=[('customer_id', sorted_id.dtype),
('amount', sorted_amount.dtype),
('ranking', rank.dtype)])
new_arr['customer_id'] = sorted_id
new_arr['amount'] = sorted_amount
new_arr['ranking'] = rank
new_arr[:10]
Out[019]
array([('CS011415000006', 10925, 1), ('ZZ000000000000', 6800, 2),
('CS028605000002', 5780, 3), ('ZZ000000000000', 5480, 4),
('ZZ000000000000', 5480, 4), ('CS015515000034', 5480, 4),
('CS021515000089', 5440, 7), ('ZZ000000000000', 5440, 7),
('ZZ000000000000', 5280, 9), ('ZZ000000000000', 5280, 9)],
dtype=[('customer_id', '<U14'), ('amount', '<i4'), ('ranking', '<i8')])
Time[019]
#Model answer
%%timeit
df_tmp = pd.concat([df_receipt[['customer_id', 'amount']],
df_receipt['amount'].rank(method='min', ascending=False)], axis=1)
df_tmp.columns = ['customer_id', 'amount', 'ranking']
df_tmp.sort_values('ranking', ascending=True).head(10)
# 35 ms ± 643 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#A little improvement in one line
%%timeit
df_receipt[['customer_id', 'amount']] \
.assign(ranking=df_receipt['amount'].rank(method='min', ascending=False)) \
.sort_values('ranking', ascending=True).head(10)
# 34.1 ms ± 943 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Numpy
%%timeit
tmp_amount = np.ascontiguousarray(arr_receipt['amount'])
sorter_index = tmp_amount.argsort()[::-1]
sorted_id = arr_receipt['customer_id'][sorter_index]
sorted_amount = tmp_amount[sorter_index]
rank_cutidx = np.concatenate(([True], sorted_amount[1:] != sorted_amount[:-1]))
rank = rank_cutidx.nonzero()[0][rank_cutidx.cumsum()-1]+1
new_arr = np.empty(arr_receipt.size, dtype=[('customer_id', sorted_id.dtype), ('amount', sorted_amount.dtype), ('ranking', rank.dtype)])
new_arr['customer_id'] = sorted_id
new_arr['amount'] = sorted_amount
new_arr['ranking'] = rank
new_arr[:10]
# 22.6 ms ± 464 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I managed to process it faster than pandas.
P_020
P-020: Give ranks to the receipt detail data frame (df_receipt) in descending order of sales amount (amount) per item, and extract the first 10 items. Items should display customer ID (customer_id), sales amount (amount), and assigned rank. In addition, even if the sales amount (amount) is the same, give a different ranking.
This is a ranking, or just a serial number with np.arange ()
.
In[020]
rfn.append_fields(arr_receipt[['customer_id', 'amount']]
[arr_receipt['amount'].argsort()[::-1]],
'ranking', np.arange(1, arr_receipt.size+1),
dtypes='<i4', usemask=False)[:10]
In[020]
sorter_index = arr_receipt['amount'].argsort()[::-1]
sorted_id = arr_receipt['customer_id'][sorter_index]
sorted_amount = arr_receipt['amount'][sorter_index]
rank = np.arange(1, sorted_id.size+1)
#Array creation
new_arr = np.empty(arr_receipt.size, dtype=[('customer_id', sorted_id.dtype),
('amount', sorted_amount.dtype),
('ranking', rank.dtype)])
new_arr['customer_id'] = sorted_id
new_arr['amount'] = sorted_amount
new_arr['ranking'] = rank
new_arr[:10]
Out[020]
array([('CS011415000006', 10925, 1), ('ZZ000000000000', 6800, 2),
('CS028605000002', 5780, 3), ('ZZ000000000000', 5480, 4),
('ZZ000000000000', 5480, 5), ('CS015515000034', 5480, 6),
('CS021515000089', 5440, 7), ('ZZ000000000000', 5440, 8),
('ZZ000000000000', 5280, 9), ('ZZ000000000000', 5280, 10)],
dtype=[('customer_id', '<U14'), ('amount', '<i4'), ('ranking', '<i4')])
P_021
P-021: Count the number of receipt details data frame (df_receipt).
In[021]
arr_receipt.size
len(arr_receipt)
arr_receipt.shape[0]
Out[021]
104681
104681
104681
P_022
P-022: Count the number of unique cases for the customer ID (customer_id) of the receipt detail data frame (df_receipt).
The unique count of string data is not good for NumPy.
In[022]
np.unique(arr_receipt['customer_id']).size
Out[022]
8307
This is known to be much faster with pandas than with NumPy, as pandas boasts in the official docs, and we, the NumPy pilots, probably can't help.
Time[022]
%timeit len(df_receipt['customer_id'].unique())
# 8.19 ms ± 204 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.unique(arr_receipt['customer_id']).size
# 30.9 ms ± 635 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Recommended Posts