Extract non-numeric elements with pandas.DataFrame

This article is from Furukawa Lab Advent_calendar Day 18.

This article was written by a student at Furukawa Lab as part of his studies. The content may be ambiguous or the expression may be slightly different.

Introduction

When I was pre-processing with pandas, I felt awkward because the data contained elements other than numerical values, so I will summarize it as an article. Also, the code in this article uses jupyter notebook.

Data to handle

kaggle's 2,2k + Scotch Whiskey Reviews Dataset

Data description

This is a dataset evaluated by reviewers about Scotch whiskey. The number of data is 2247 and the number of items is 7.

Data confirmation

#Library import
import pandas as pd
import numpy as np
#Read csv file
data = pd.read_csv('scotch_review.csv')
#Data display
data.head()

Main subject

Check if the data to be handled has elements with types other than numerical values

This time we will only use the items'review.point'and'price'. Let's look at the data type of each column

#Type confirmation
data[['review.point','price']].dtypes

kata.jpg

It seems that non-numeric elements are mixed in the'review.point' column. ** The following code can be used to determine if there are non-numeric elements in the corresponding column. (* Str type number returns True) **

#'price', 'review.point'If the column has elements that cannot be converted to numeric type'False'return it.
data[['review.point', 'price']].apply(lambda s:pd.to_numeric(s, errors='coerce')).notnull().all()

kata.jpg

Extraction and conversion of elements with non-numeric types

From here, we will extract elements with types other than numbers from the'price'column and replace them. First is extraction.

#Extraction of non-numeric type elements
pic = data[['price']][data['price'].apply(lambda s:pd.to_numeric(s, errors='coerce')).isnull()]
pic

!

Here, the data of / set and / liter are treated as missing values, and the others are converted to numeric type.

# ','Delete,'/'Replace the element containing with a missing value
change_data = pic['price'].str.replace(',','').mask(pic['price'].str.contains('/'), np.nan) 
change_data

str_data.jpg

Reflect the changes in the original data.

#Make a copy of the original data and replace the relevant part
data_c = data.copy()
data_c.loc[pic.index,'price'] = change_data

Finally, convert the number in the'price' column to a numeric type and delete the row containing the missing value.

data_c['price'] = pd.to_numeric(data_c['price'], errors = 'ignore')
df  = data_c.dropna()

At the end

This time, I extracted and replaced elements other than numerical values with pandas.DataFrame. In the next article, I'll visualize this preprocessed Whiskey Reviews dataset.

appendix

Here is a description of the function I was using in my code.

Determining if there are elements that cannot be converted to numeric type

#'price', 'review.point'If the column has elements that cannot be converted to numeric type'False'return it.
data[['review.point', 'price']].apply(lambda s:pd.to_numeric(s, errors='coerce')).notnull().all()
def function name(argument):
return expression

--DataFrame.apply (function, axis = 0) --Pass the DataFrame element as a function argument. Select the passing method (row direction or column direction) with axis.

Partial replacement of the string of each element and replacement of the value itself

# ','Delete,'/'Replace the element containing with a missing value
change_data = pic['price'].str.replace(',','').mask(pic['price'].str.contains('/'), np.nan) 
change_data

--Series.str.replace ('String A','String B') --Convert'string A'contained in each element of Series to'string B' --Series.str.contains ('string A') --Returns elements containing'string A'as True and other elements as False. --Series.mask (arg,'value') --Replace the True element of arg (Series) with a'value'and do not change the'False' element.

Recommended Posts

Extract non-numeric elements with pandas.DataFrame
Extract multiple elements with Numpy array
Extract EXIF with sips
Extract elements other than a specific index with Numpy
Extract Twitter data with CSV
Extract numbers with regular expressions
Extract array elements and indexes in descending order with numpy
Extract peak values with scipy
Extract the xz file with python
Extract the maximum value with pandas.
[Automation] Extract Outlook appointments with Python
[Beginner] Extract character strings with Python
Extract specific multiple columns with pandas