Continuing from the last time, I will explain 100 knocks of data science. Data science 100 knock commentary (P001 ~ 020) Data science 100 knock commentary (P021 ~ 040) Data science 100 knock commentary (P041 ~ 060) Data science 100 knock commentary (P061 ~ 080)
For installation, please refer to this article (* Docker is handled on Mac)
P-081: Create a new df_product_2 that complements the missing values of unit price (unit_price) and cost (unit_cost) with their respective average values. The average value of less than 1 yen may be rounded off, and 0.5 may be even-numbered. After implementing the supplement, also confirm that there are no defects in each item.
P-081
#Fill in missing values with fillna.
# np.Round off even numbers in round. np.Mean value excluding missing values with nanmean
df_product_2 = df_product.fillna({'unit_price':np.round(np.nanmean(df_product['unit_price'])),
'unit_cost':np.round(np.nanmean(df_product['unit_cost']))})
# isnull().sum()Check for defects in
df_product2.isnull().sum()
Reference: Round numbers with pandas (rounding, rounding to even numbers) Reference: Calculate the sum and average of the array ndarray containing the missing value np.nan with NumPy
P-082: Create a new df_product_3 that complements the missing values of unit price (unit_price) and cost (unit_cost) with their medians. The median value of less than 1 yen may be rounded off, and 0.5 may be even-numbered. After implementing the supplement, also confirm that there are no defects in each item.
P-082
#Use fillna for missing values
# np.Round off even numbers in round. Median np.Complement with nanmedian
df_product_3 = df_product.fillna({'unit_price':np.round(np.nanmedian(df_product['unit_price'])),
'unit_cost':np.round(np.nanmedian(df_product['unit_cost']))})
# isnull().sum()Check for defects in
df_product_3.isnull().sum()
Reference: Round numbers with pandas (rounding, rounding to even numbers) Reference: [NumPy] 11. Various processing related to NaN in NumPy array
P-083: Create a new df_product_4 that complements the missing values of unit price (unit_price) and cost (unit_cost) with the median calculated for each subcategory (category_small_cd) of each product. The median value of less than 1 yen may be rounded off, and 0.5 may be even-numbered. After implementing the supplement, also confirm that there are no defects in each item.
P-083
# (Solution 1)
#Subcategory of each product_small_Calculate the median for each cd)
df_tmp = df_product.groupby('category_small_cd').agg({'unit_price': 'median',
'unit_cost': 'median'}).reset_index()
#Change the column name
df_tmp.columns = ['category_small_cd', 'median_price', 'median_cost']
#Product data frame (df_A data frame containing product) and the median(df_tmp)To combine
df_product_4 = pd.merge(df_product, df_tmp, how='inner', on='category_small_cd')
#Unit price (unit)_Complement the median to the missing value of price)(axis=1 applies to each line)
df_product_4['unit_price'] = df_product_4[['unit_price', 'median_price']]. \
apply(lambda x: np.round(x[1]) if np.isnan(x[0]) else x[0], axis=1)
#Cost (unit)_Complement the median to the missing value of cost)(axis=1 applies to each line)
df_product_4['unit_cost'] = df_product_4[['unit_cost', 'median_cost']]. \
apply(lambda x: np.round(x[1]) if np.isnan(x[0]) else x[0], axis=1)
# isnull().sum()Check for defects in
df_product_4.isnull().sum()
P-083
# (Solution 2)
#Subcategory of each product_small_Calculate the median for each cd)
df_tmp = df_product.groupby('category_small_cd').agg(median_price=('unit_price', 'median'),
median_cost=('unit_cost', 'median')).reset_index()
#Product data frame (df_A data frame containing product) and the median(df_tmp)To combine
df_product_4 = pd.merge(df_product, df_tmp, how='inner', on='category_small_cd')
#Missing value using mask(Nan)The part of is a mask (so that it cannot be seen)And substitute the median
# mask(Conditions to hide,Value to fill), Rounded to round
df_product_4['unit_price'] = (df_product_4['unit_price'].mask(df_product_4['unit_price'].isnull(),
df_product_4['median_price'].round()))
df_product_4['unit_cost'] = (df_product_4['unit_cost'].mask(df_product_4['unit_cost'].isnull(),
df_product_4['median_price'].round()))
# isnull().sum()Check for defects in
df_product_4.isnull().sum()
P-083
# (Solution 3)
#Product data frame (df_copy product)
df_product_4 = df_product.copy()
#Unit price (unit)_price) and cost (unit)_cost) Perform each process
#Use fillna to fill in missing values.
#Subcategory of each product_small_Calculate the median for each cd). Use transform.
for x in ['unit_price', 'unit_cost']:
df_product_4[x] = df_product_4[x].fillna(df_product_4.groupby('category_small_cd')[x]
.transform('median').round())
# isnull().sum()Check for defects in
df_product_4.isnull().sum()
Reference: [NumPy] 11. Various processing related to NaN in NumPy array Reference: Python pandas data selection process in a little more detail <Part 2> Reference: Basic differences between Pandas transform and apply
P-084: Calculate the ratio of 2019 sales to all sales for the entire period for all customers in the customer data frame (df_customer). However, if there is no sales record, treat it as 0. Then extract the calculated ratio over 0. You can display 10 results. Also, make sure that the created data does not have NA or NAN.
P-084
#Receipt detail data frame (df_receipt) sales_ymd extracts 2019(df_tmp1)
df_tmp_1 = df_receipt.query('20190101 <= sales_ymd <= 20191231')
#Customer data frame (df_customer) customer_Receipt detail data frame with only id and 2019 extracted(df_tmp1)Left join
# customer_Group by id (increased at the time of combination) and sales amount(amount)Sum
#Change the column name(rename)
df_tmp_1 = pd.merge(df_customer['customer_id'], df_tmp_1[['customer_id', 'amount']], how='left', on='customer_id'). \
groupby('customer_id').sum().reset_index().rename(columns={'amount': 'amount_2019'})
#Sum the sales amount for the entire period by customer
df_tmp_2 = pd.merge(df_customer['customer_id'], df_receipt[['customer_id', 'amount']], how='left', on='customer_id'). \
groupby('customer_id').sum().reset_index()
# df_tmp1 and df_Combine tmp2(Sales for 2019 and all periods(amount)For comparison)
df_tmp = pd.merge(df_tmp1, df_tmp_2, how='inner', on='customer_id')
# #Complement each missing value with 0
df_tmp['amount_2019'] = df_tmp['amount_2019'].fillna(0)
df_tmp['amount'] = df_tmp['amount'].fillna(0)
#Percentage(amount_rate)To fill in the missing values
df_tmp['amount_rate'] = df_tmp['amount_2019'] / df_tmp['amount']
df_tmp['amount_rate'] = df_tmp['amount_rate'].fillna(0)
#Percentage(amount_rate)Display 10 items with more than 0
df_tmp.query('amount_rate > 0').head(10)
Reference: How to use pandas (merge, join, concat) Reference: Change row name / column name of pandas.DataFrame
P-085: Create a new df_customer_1 by associating the longitude / latitude conversion data frame (df_geocode) with the postal code (postal_cd) for all customers in the customer data frame (df_customer). However, when multiple links are made, calculate the average for each longitude (longitude) and latitude (latitude).
P-085
#Customer data frame (df_customer) and data frame for longitude / latitude conversion (df)_geocode)
df_customer_1 = pd.merge(df_customer[['customer_id', 'postal_cd']],
df_geocode[['postal_cd', 'longitude' ,'latitude']],
how='inner', on='postal_cd')
#Since there are duplicates in the data frame for longitude / latitude conversion, multiple data frames are linked.
# customer_Group by id and average longitude (longitude) and latitude (latitude) respectively
#Change the column name
df_customer_1 = df_customer_1.groupby('customer_id'). \
agg({'longitude':'mean', 'latitude':'mean'}).reset_index(). \
rename(columns={'longitude':'m_longitude', 'latitude':'m_latitude'})
#Customer data frame (df_No duplication with customer)(df_customer_1)To combine
df_customer_1 = pd.merge(df_customer, df_customer_1, how='inner', on='customer_id')
df_customer_1.head(3)
Reference: How to use pandas (merge, join, concat) Reference: Change row name / column name of pandas.DataFrame
P-086: For the customer data frame (df_customer_1) with latitude and longitude created in the previous question, combine it with the store data frame (df_store) using the application store code (application_store_cd) as a key. Then, find the distance (km) using the latitude / longitude information (longitude) of the application store and the latitude / longitude of the customer, and display it together with the customer ID (customer_id), customer address (address), and store address (address). .. The calculation formula may be a simple formula, but a library using another highly accurate method may be used. Only 10 results should be displayed.
P-086
#Create a formula function
def calc_distance(x1, y1, x2, y2):
x1_r = np.radians(x1)
x2_r = np.radians(x2)
y1_r = np.radians(y1)
y2_r = np.radians(y2)
return 6371 * np.arccos(np.sin(y1_r) * np.sin(y2_r)
+ np.cos(y1_r) * np.cos(y2_r)
* np.cos(x1_r - x2_r))
#Customer data frame with latitude and longitude (df_customer_1) and store data frame (df_Combine with store)
df_tmp = pd.merge(df_customer_1, df_store, how='inner',
left_on=['application_store_cd'], right_on=['store_cd'])
#Find the distance
df_tmp['distance'] = calc_distance(df_tmp['m_longitude'], df_tmp['m_latitude'],
df_tmp['longitude'], df_tmp['latitude'])
df_tmp[['customer_id', 'address_x', 'address_y', 'distance']].head(10)
# (Another solution)Pattern using math
def calc_distance(x1, y1, x2, y2):
distance = 6371 * math.acos(math.sin(math.radians(y1)) * math.sin(math.radians(y2))
+ math.cos(math.radians(y1)) * math.cos(math.radians(y2))
* math.cos(math.radians(x1) - math.radians(x2)))
return distance
df_tmp = pd.merge(df_customer_1, df_store, how='inner', left_on='application_store_cd', right_on='store_cd')
df_tmp['distance'] = df_tmp[['m_longitude', 'm_latitude','longitude', 'latitude']]. \
apply(lambda x: calc_distance(x[0], x[1], x[2], x[3]), axis=1)
df_tmp[['customer_id', 'address_x', 'address_y', 'distance']].head(10)
P-087: In the customer data frame (df_customer), the same customer is registered multiple times due to applications at different stores. Consider customers with the same name (customer_name) and zip code (postal_cd) as the same customer, and create a Nayoro customer data frame (df_customer_u) named so that there is one customer and one record. However, for the same customer, the one with the highest total sales amount shall be left, and for customers with the same total sales amount or no sales record, the one with the smaller customer ID (customer_id) number shall be left.
P-087
#Receipt detail data frame (df_Receipt) is divided into groups for each customer and the total sales amount is totaled.
df_tmp = df_receipt.groupby('customer_id').agg({'amount': sum}).reset_index()
#Customer data frame (df_customer) and the sales amount for each customer created earlier
#Customer data frame (df_Left join to leave all customer) keys(how='left')
#Ascending by customer and sorting amounts in descending order
df_customer_u = pd.merge(df_customer, df_tmp, how='left', on='customer_id'). \
sort_values(['amount', 'customer_id'], ascending=[False, True])
#Name (customer) to be 1 customer 1 record_name) and zip code (postal)_cd) remove duplicates.(As a result of sorting, duplicates leave the first one)
df_customer_u.drop_duplicates(subset=['customer_name', 'postal_cd'], keep='first', inplace=True)
#Output the reduced number
print('Number of decrease: ', len(df_customer) - len(df_customer_u))
Reference: Data concatenation / join processing as seen in Python pandas diagram Reference: sort_values, sort_index to sort pandas.DataFrame, Series Reference: Extract / delete duplicate rows of pandas.DataFrame, Series
P-088: Create a data frame (df_customer_n) with an integrated Nayoro ID added to the customer data frame based on the data created in the previous question. However, the integrated Nayoro ID shall be assigned according to the following specifications.
Non-duplicate customers: Set customer ID (customer_id) Duplicate customer: Set the customer ID of the record extracted in the previous question
P-088
#Customer data frame (df_Nayoro customer data frame (df) named customer)_customer_u) combine
#* Name (customer)_name) and zip code (postal)_cd) joins matches
df_customer_n = pd.merge(df_customer, df_customer_u[['customer_id', 'customer_name', 'postal_cd']],
how='inner', on=['customer_name', 'postal_cd'])
#Change column name
df_customer_n.rename(columns={'customer_id_x': 'customer_id', 'customer_id_y': 'integration_id'}, inplace=True)
#Find the difference in ID
print('Difference in the number of IDs', len(df_customer_n['customer_id'].unique()) - len(df_customer_n['integration_id'].unique()))
Reference: Count the number and frequency (number of occurrences) of unique elements in pandas
P-089: For customers with sales records, I want to divide it into learning data and test data in order to build a forecast model. Randomly divide the data at a ratio of 8: 2 each.
P-089
#Create customer data with a proven sales record(df_customer and df_Combine receipt)
df_tmp = pd.merge(df_customer, df_receipt['customer_id'], how='inner', on='customer_id')
# df_Divide customer into training data and test data(train_test_split)Use
df_train, df_test = train_test_split(df_tmp, test_size=0.2, random_state=71)
#Calculate the ratio of training data and test data
print('Training data ratio: ', len(df_train) / len(df_tmp))
print('Test data percentage: ', len(df_test) / len(df_tmp))
Reference: train_test_split that splits data for training and testing with scikit-learn
P-090: Receipt detail data frame (df_receipt) has data from January 1, 2017 to October 31, 2019. Aggregate the sales amount (amount) monthly and create 3 sets of model building data for 12 months for learning and 6 months for testing.
P-090
#Receipt detail data frame (df_Receipt)
df_tmp = df_receipt.copy()
# sales_Change ymd from int64 to str type and extract the year and month using slices
df_tmp['sales_ymd'] = df_tmp['sales_ymd'].astype('str').str[:6]
# sales_ymd(Year / month)Group by and add up the sales amount (amount)
df_tmp = df_tmp.groupby('sales_ymd').agg({'amount': sum}).reset_index()
#Function to separate training data and test data
def split_date(df, train_size, test_size, slide_window, start_point):
train_start = start_point * slide_window
test_start = train_start + train_size
return df[train_start:test_start], df[test_start:test_start+test_size]
#Create 3 sets of model building data
df_train_1, df_test_1 = split_date(df_tmp, train_size=12, test_size=6, slide_window=6, start_point=0)
df_train_2, df_test_2 = split_date(df_tmp, train_size=12, test_size=6, slide_window=6, start_point=1)
df_train_3, df_test_3 = split_date(df_tmp, train_size=12, test_size=6, slide_window=6, start_point=2)
(Supplement: Understand the problem) There are 3 models with 12-month data for learning and 6-month data for testing.
model | Training data | Training data範囲 | test data | test data範囲 |
---|---|---|---|---|
Model 1 | df_train_1 | 2017/January-2017/December | df_train_1 | 2017/January-2017/December |
Model 2 | df_train_2 | 2017/June-2018/June | df_train_2 | 2018/June-2018/December |
Model 3 | df_train_3 | 2018/January-2018/December | df_train_3 | 2019/January-2019/June |
Reference: train_test_split that splits data for training and testing with scikit-learn
P-091: For each customer in the customer data frame (df_customer), extract by undersampling so that the number of customers with sales record and the number of customers without sales record are 1: 1.
P-091
#Receipt detail data frame (df_Receipt) is grouped by customer and the sales are totaled.
df_tmp = df_receipt.groupby('customer_id').agg({'amount': sum}).reset_index()
#Customer data frame (df_customer) and receipt detail data frame (df)_receipt)(Left join)
df_tmp = pd.merge(df_customer, df_tmp, how='left', on='customer_id')
#Sales performance(amount)Enter 1 if there is, 0 if not
df_tmp['buy_flg'] = df_tmp['amount'].apply(lambda x: 0 if np.isnan(x) else 1)
#Output the number of customers with sales record and the number of customers without sales record
print('Number of 0', len(df_tmp.query('buy_flg==0')))
print('Number of 1', len(df_tmp.query('buy_flg==1')))
#Undersampling with RandomUnderSampler
rs = RandomUnderSampler(random_state=71)
df_sample, _ = rs.fit_sample(df_tmp, df_tmp.buy_flg)
#Output the number of customers with sales record and the number of customers without sales record
print('Number of 0', len(df_sample.query('buy_flg==0')))
print('Number of 1', len(df_sample.query('buy_flg==1')))
Reference: [Kaggle] I tried undersampling using imbalanced-learn Reference: downsampling of unbalanced data of multi-class classification
P-092: In the customer data frame (df_customer), gender information is kept in a denormalized state. Normalize this third.
P-092
#Customer data frame (df_customer) gender_Remove cd duplication
df_gender = df_customer[['gender_cd', 'gender']].drop_duplicates()
#Gender data frame(df_gender)Created so remove the gender column from the customer dataframe
df_customer_s = df_customer.drop(columns='gender')
Reference: Extract / delete duplicate rows of pandas.DataFrame, Series
P-093: The product data frame (df_product) holds only the code value of each category, not the category name. Create a new product data frame with the category name by denormalizing it in combination with the category data frame (df_category).
P-093
#Product data frame (df_product) and category data frame (df_category)
df_product_full = pd.merge(df_product,df_category[['category_small_cd', 'category_major_name',
'category_medium_name', 'category_small_name']],
how='inner', on='category_small_cd')
df_product_full.head()
The key is set to "category_small_cd" because it is a category in "category_major_cd" and "category_medium_cd" and cannot be classified in detail.
P-094: Output the product data with the category name created earlier to a file with the following specifications. The output destination path is under data. -File format is CSV (comma separated values) ・ With header -Character code is UTF-8
P-094
# to_csv(File format is CSV)Output using
df_product_full.to_csv('./data/P_df_product_full_UTF-8_header.csv', encoding='UTF-8', index=False)
# (Another solution)
#Code example 2 (with BOM to prevent garbled characters in Excel)
df_product_full.to_csv('./data/P_df_product_full_UTF-8_header.csv',
encoding='utf_8_sig',
index=False)
Reference: Export / add csv file with pandas (to_csv)
P-095: Output the product data with the category name created earlier to a file with the following specifications. The output destination path is under data. -File format is CSV (comma separated values) ・ With header ・ Character code is CP932
P-095
# to_csv(File format is CSV)Output using
df_product_full.to_csv('./data/P_df_product_full_CP932_header.csv', encoding='CP932', index=False)
Reference: Export / add csv file with pandas (to_csv)
P-096: Output the product data with the category name created earlier to a file with the following specifications. The output destination path is under data. -File format is CSV (comma separated values) ・ No header -Character code is UTF-8
P-096
# to_csv(File format is CSV)Output using
df_product_full.to_csv('./data/P_df_product_full_UTF-8_noh.csv', header=False ,encoding='UTF-8', index=False)
Reference: Export / add csv file with pandas (to_csv)
P-097: Read the file in the following format created earlier and create a data frame. Also, display the first 10 items and check that they are taken correctly. -File format is CSV (comma separated values) ・ With header -Character code is UTF-8
P-097
# read_Read using csv
df_tmp = pd.read_csv('./data/P_df_product_full_UTF-8_header.csv')
df_tmp.head(10)
Reference: Read csv / tsv file with pandas (read_csv, read_table)
P-098: Read the file in the following format created earlier and create a data frame. Also, display the first 10 items and check that they are taken correctly. -File format is CSV (comma separated values) ・ No header -Character code is UTF-8
P-098
# read_Read using csv
df_tmp = pd.read_csv('./data/P_df_product_full_UTF-8_noh.csv', header=None)
df_tmp.head(10)
Reference: Read csv / tsv file with pandas (read_csv, read_table)
P-099: Output the product data with the category name created earlier to a file with the following specifications. The output destination path is under data. -File format is TSV (tab delimited) ・ With header -Character code is UTF-8
P-099
# to_csv(File format is CSV)Output using
#Tab character\Save as tsv file separated by t
df_product_full.to_csv('./data/P_df_product_full_UTF-8_header.tsv', sep='\t', encoding='UTF-8', index=False)
Reference: Export / add csv file with pandas (to_csv)
P-100: Read the file in the following format created earlier and create a data frame. Also, display the first 10 items and check that they are taken correctly. -File format is TSV (tab delimited) ・ With header -Character code is UTF-8
P-100
#Read TSV_use table
df_tmp = pd.read_table('./data/P_df_product_full_UTF-8_header.tsv', encoding='utf-8')
df_tmp.head(10)
Reference: Read csv / tsv file with pandas (read_csv, read_table)
100 knocks of data science How to run 100 data science knocks on Mac
The content was very chewy until 081-093. There are some places where the explanation is shallow, so I will improve the explanation while reviewing it.
Recommended Posts