This article is from Furukawa Lab Advent_calendar Day 18.

This article was written by a student at Furukawa Lab as part of his studies. The content may be ambiguous or the expression may be slightly different.

Introduction

When I was pre-processing with pandas, I felt awkward because the data contained elements other than numerical values, so I will summarize it as an article. Also, the code in this article uses jupyter notebook.

Data to handle

kaggle's 2,2k + Scotch Whiskey Reviews Dataset

Data description

This is a dataset evaluated by reviewers about Scotch whiskey. The number of data is 2247 and the number of items is 7.

Data confirmation

#Library import
import pandas as pd
import numpy as np
#Read csv file
data = pd.read_csv('scotch_review.csv')
#Data display
data.head()

Main subject

Check if the data to be handled has elements with types other than numerical values

This time we will only use the items'review.point'and'price'. Let's look at the data type of each column

#Type confirmation
data[['review.point','price']].dtypes

It seems that non-numeric elements are mixed in the'review.point' column. ** The following code can be used to determine if there are non-numeric elements in the corresponding column. (* Str type number returns True) **

#'price', 'review.point'If the column has elements that cannot be converted to numeric type'False'return it.
data[['review.point', 'price']].apply(lambda s:pd.to_numeric(s, errors='coerce')).notnull().all()

Extraction and conversion of elements with non-numeric types

From here, we will extract elements with types other than numbers from the'price'column and replace them. First is extraction.

#Extraction of non-numeric type elements
pic = data[['price']][data['price'].apply(lambda s:pd.to_numeric(s, errors='coerce')).isnull()]
pic

Here, the data of / set and / liter are treated as missing values, and the others are converted to numeric type.

# ','Delete,'/'Replace the element containing with a missing value
change_data = pic['price'].str.replace(',','').mask(pic['price'].str.contains('/'), np.nan) 
change_data

Reflect the changes in the original data.

#Make a copy of the original data and replace the relevant part
data_c = data.copy()
data_c.loc[pic.index,'price'] = change_data

Finally, convert the number in the'price' column to a numeric type and delete the row containing the missing value.

data_c['price'] = pd.to_numeric(data_c['price'], errors = 'ignore')
df  = data_c.dropna()

At the end

This time, I extracted and replaced elements other than numerical values with pandas.DataFrame. In the next article, I'll visualize this preprocessed Whiskey Reviews dataset.

appendix

Here is a description of the function I was using in my code.

Determining if there are elements that cannot be converted to numeric type

#'price', 'review.point'If the column has elements that cannot be converted to numeric type'False'return it.
data[['review.point', 'price']].apply(lambda s:pd.to_numeric(s, errors='coerce')).notnull().all()

to_numeric(arg, errors = 'coerce') --Convert each element of arg (pd.Series) to a numeric type. If it cannot be converted, it processes according to the argument passed to errors. Since it is'corece'this time, the element that cannot be converted is replaced with the missing value NaN. (default is'raise') --Function name = lamba Arguments: Expression --Anonymous function. It works the same as the function below.

def function name(argument):
return expression

--DataFrame.apply (function, axis = 0) --Pass the DataFrame element as a function argument. Select the passing method (row direction or column direction) with axis.

DataFrame.notnull() --Returns True for each element of pandas.DataFrame if it is not a missing value, False if it is a missing value.
DataFrame.all(axis = None) --Returns True if all elements in the row or column are True.

Partial replacement of the string of each element and replacement of the value itself

# ','Delete,'/'Replace the element containing with a missing value
change_data = pic['price'].str.replace(',','').mask(pic['price'].str.contains('/'), np.nan) 
change_data

--Series.str.replace ('String A','String B') --Convert'string A'contained in each element of Series to'string B' --Series.str.contains ('string A') --Returns elements containing'string A'as True and other elements as False. --Series.mask (arg,'value') --Replace the True element of arg (Series) with a'value'and do not change the'False' element.

Extract non-numeric elements with pandas.DataFrame