This article is from Furukawa Lab Advent_calendar Day 18.
This article was written by a student at Furukawa Lab as part of his studies. The content may be ambiguous or the expression may be slightly different.
When I was pre-processing with pandas, I felt awkward because the data contained elements other than numerical values, so I will summarize it as an article. Also, the code in this article uses jupyter notebook.
kaggle's 2,2k + Scotch Whiskey Reviews Dataset
This is a dataset evaluated by reviewers about Scotch whiskey. The number of data is 2247 and the number of items is 7.
#Library import
import pandas as pd
import numpy as np
#Read csv file
data = pd.read_csv('scotch_review.csv')
#Data display
data.head()
This time we will only use the items'review.point'and'price'. Let's look at the data type of each column
#Type confirmation
data[['review.point','price']].dtypes
It seems that non-numeric elements are mixed in the'review.point' column. ** The following code can be used to determine if there are non-numeric elements in the corresponding column. (* Str type number returns True) **
#'price', 'review.point'If the column has elements that cannot be converted to numeric type'False'return it.
data[['review.point', 'price']].apply(lambda s:pd.to_numeric(s, errors='coerce')).notnull().all()
From here, we will extract elements with types other than numbers from the'price'column and replace them. First is extraction.
#Extraction of non-numeric type elements
pic = data[['price']][data['price'].apply(lambda s:pd.to_numeric(s, errors='coerce')).isnull()]
pic
!
Here, the data of / set and / liter are treated as missing values, and the others are converted to numeric type.
# ','Delete,'/'Replace the element containing with a missing value
change_data = pic['price'].str.replace(',','').mask(pic['price'].str.contains('/'), np.nan)
change_data
Reflect the changes in the original data.
#Make a copy of the original data and replace the relevant part
data_c = data.copy()
data_c.loc[pic.index,'price'] = change_data
Finally, convert the number in the'price' column to a numeric type and delete the row containing the missing value.
data_c['price'] = pd.to_numeric(data_c['price'], errors = 'ignore')
df = data_c.dropna()
This time, I extracted and replaced elements other than numerical values with pandas.DataFrame. In the next article, I'll visualize this preprocessed Whiskey Reviews dataset.
Here is a description of the function I was using in my code.
#'price', 'review.point'If the column has elements that cannot be converted to numeric type'False'return it.
data[['review.point', 'price']].apply(lambda s:pd.to_numeric(s, errors='coerce')).notnull().all()
def function name(argument):
return expression
--DataFrame.apply (function, axis = 0) --Pass the DataFrame element as a function argument. Select the passing method (row direction or column direction) with axis.
# ','Delete,'/'Replace the element containing with a missing value
change_data = pic['price'].str.replace(',','').mask(pic['price'].str.contains('/'), np.nan)
change_data
--Series.str.replace ('String A','String B') --Convert'string A'contained in each element of Series to'string B' --Series.str.contains ('string A') --Returns elements containing'string A'as True and other elements as False. --Series.mask (arg,'value') --Replace the True element of arg (Series) with a'value'and do not change the'False' element.
Recommended Posts