A personal note that I update from time to time as I learn about "preprocessing" data using pandas.
――When you think of preprocessing, you might think of AI / machine learning. The purpose of pre-processing learning is because I want to be able to process and analyze accumulated data as I wish. It may be insufficient as preprocessing information for machine learning purposes. ――I am not a person in the AI field, so if you are looking for knowledge in that area, I think it would be better to keep the content of this article for reference only. ――Because it is about to be written, there are many ambiguous expressions, and there may be information that is difficult to understand. We plan to update it from time to time, but we would appreciate it if you could point out any incorrect information.
To clean or process the accumulated data in order to perform the desired work (machine learning, etc.). It seems that machine learning people also spend the most time on this work together with the data collection work.
--Data cleansing --Synonymous with data cleaning. Perhaps refers to some work in "pretreatment". --Data cleaning --Synonymous with data cleansing. --Data Langling -Broader than the above two. Maybe "synonymous with pretreatment" or "includes pretreatment".
Do the following: There may be a more efficient order, and there are more detailed tasks.
Pre-analysis --Check the amount of data information, status, etc.
Cleaning --Changing column names --Checking notation fluctuations --Check for duplicate lines --Check for missing values --Delete unnecessary columns
Processing / conversion / configuration change --Add column --Merge other data --Addition of features
Prepare a Python test environment for learning.
--Internet environment -Google Colabratory is free and amazing. ――If you write a process to dump the result to Google Drive even if it is a heavy process, you can check the dump file when the process is completed even if you close the browser or PC once. --Refer to the past article for how to link Google Drive X Colab (Mount and operate Google Drive with Google colabo)
--Local environment
It would be nice to be able to quickly prepare test data in both series and Dataframe.
Series
in preparation
Dataframe
in preparation
#References for maximum, minimum, mean, standard deviation, etc.
df.describe()
#Data type reference for each column
df.info()
#Browse the first few lines of data
df.head()
#Browse the last few lines of data
df.tail()
#Number of dimensions of data (how many rows and columns)
df.shape
#Number of values
df["column"].value_counts()
#Number of values (same as above)
pd.value_counts(df["column"])
#List of column names ①
df.columns
#List of column names ②
df.count()
#Missing value count ("Number of lines-The number of data"When requesting at)
#You can list the missing status of the entire column.
df_s.shape[0] - df_s.count()
#Missing value count(Specific column)
df_s["Defective content.1"].isnull().value_counts()
#Show rows with missing values(In a specific column)
df_s.loc[df_s["Defective content.1"].isnull()]
#Delete a row if there is even one missing value in df
df.dropna(axis = 0, how = 'any', inplace=True)
#Set a threshold and drop (delete rows with 3 or more missing columns)
#If it is not replaced, it will not be reflected in the source data.
df.dropna(thresh=3,inplace=True)
fillna
df.loc[df["unit price"].isna(),"unit price"] = 0
#If there are any missing values in the calculation, the result will be Nan. Use the option to ignore missing values.
df.sum(skipna=False)
df.mean(skipna=False)
df_s.duplicated().value_counts()
df_s[df_s.duplicated() == True].count()
#Column unit
df_s[df_s.duplicated() == True ]["parts"].count()
Check that the data type is correct and convert it to the appropriate type if necessary. When converting to a numeric type, it is necessary to remember the work such as setting the character string to a numerical value and setting the Null data to 0 in advance. ʻAstype` method is mainly used.
#Data type reference
df2.info()
#Data type conversion (realized by adding a new column)
df["Part number 2"] = df["Product number"].astype(object)
#Data type conversion(Substitute as is in the original column and convert)
df_s["test"] = df_s["test"].astype(object) # object(string)
df_s["test"] = df_s["test"].astype(str) # object(string)
df_s["test"] = df_s["test"].astype(float) # Float
df_s["test"] = df_s["test"].astype(int) # integer
#Convert to numeric type while forcibly eliminating (To Nan) mistyped values
df_s["test"] = pd.to_numeric(df_s["test"] , errors="coerce")
#Extracting a specific column (does not change the original data)
df[["colB","colD"]]
#Extract using a regular expression
tanka_nan = df["unit price"].str.extract("(^\D*)", expand=False)
tanka_nan.value_counts()
#「[The value in the unit price column is"Bulk product"]Extracted by "row extraction" and "column extraction of unit price column"
df.loc[df["unit price"] == "Bulk product","unit price"]
#Filter rows from list and create new DF
df2 = df[df["Product number"].isin(mylist)]
dftest["grade"] = dftest["Product name"].apply(lambda x : x[0:2] )
dftest = pd.DataFrame({"prefecture":["hokkaidou","shizuoka","okinawa"],"city":["sapporo","shizuoka","naha"]})
dftest_map = {"hokkaidou":10,"shizuoka":20,"okinawa":30}
dftest["maptest"] = dftest.prefecture.map(dftest_map)
dftest
# Create the following function.
# # If the argument value contains "limited", the string "limited product" is returned and
# # If the argument value contains "new", the string "new product" is returned and
# # If the value of the argument is other than that, the string "na" is returned.
def newitem(x):
if "Limited" in str(x):
return "Limited product"
elif "new" in str(x):
return "new product"
else:
return "na"
# Create a new column "Product Category" and return the result while evaluating the product name column with the created function.
df["Product category"] = df["Product name"].apply(newitem)
#Rename column(Change in dictionary format)
df = df.rename(columns={'Product name':'Product name'})
#Rename column (change the very first column)
df.rename(columns={df.columns[0]: "Product name"})
df.loc[:,"unit price"] = 0
df.loc[df["unit price"] == "Supply", "unit price"] = 0
df["grade"] = df["Product name"].str.extract(
"(^[A-Z]{2}|^[A-Z]{2}|^/d/d?//d+|^[0-9][0-9]?/[0-9]*)", expand=False)
Group uses groupby object
--Use df.groupby ()
.
--If you group and put it in a variable, the variable will contain a Groupby object. (Not DataFraem, Series)
df.groupby('items').get_group('item_1')
g1 = df.groupby('items')
print(g1.groups)
df.groupby('items').size()
#Collectively aggregate
df_s.groupby("Bagging location").agg(np.sum)
#Aggregation using a dictionary
df_s.groupby("Bagging location").agg({"Total production number":np.sum, "lot":"first" })
train["year"]= train["datetime"].apply(lambda x : x.split("-")[0] )
Mainly done below. Writing.
--melt (meaning melt)
in preparation
in preparation
Dividing non-continuous values into smaller numbers using bins to group them. Also called bin division. Bin seems to mean a shelf.
For example, if there is a column that contains the age of a person, use it to group by section by age group such as [teens, 20s, 30s ...].
Use cut
and qcut
.
An image of sorting the original data (age data this time) into the created "shelf". Not surprisingly, the number of bins is less than the source data.
#First, suppose that there is such age data.
ages = [random.randint(0, 100) for i in range(20)]
print(ages)
#[84, 73, 27, 85, 8, 17, 46, 16, 95, 62, 38, 47, 63, 44, 69, 26, 0, 40, 34, 79]
#Prepare Bins to group this age data
bins = np.arange(0,90,10) #Generate serial number
print(bin)
# >>> [ 0 10 20 30 40 50 60 70 80]
#Cut
cats = pd.cut(ages, bins)
print(cats)
# [NaN, (70.0, 80.0], (20.0, 30.0], NaN, (0.0, 10.0], ..., (20.0, 30.0], NaN, (30.0, 40.0], (30.0, 40.0], (70.0, 80.0]]
# Length: 20
# Categories (8, interval[int64]): [(0, 10] < (10, 20] < (20, 30] < (30, 40] < (40, 50] < (50, 60] < (60, 70] < (70, 80]]
Features = columns given directly in creating a prediction model for machine learning.
When creating a predictive model, not all columns of the collected data are required for the predictive model. It is also possible that there are not enough columns.
In machine learning, it is important to select this column or add a new column to increase the feature quantity, and increasing the feature quantity sequence is called feature quantity creation.
Creating and tuning features is called Feature Engeneering
.
--Book (link to Amazon) -[O'Reilly-Introduction to Data Analysis with Python 2nd Edition-Data Processing Using NumPy and pandas-Wes McKinney (Author)](https://www.amazon.co.jp/Python%E3%81%AB%E3 % 82% 88% E3% 82% 8B% E3% 83% 87% E3% 83% BC% E3% 82% BF% E5% 88% 86% E6% 9E% 90% E5% 85% A5% E9% 96 % 80-% E7% AC% AC2% E7% 89% 88-% E2% 80% 95NumPy% E3% 80% 81pandas% E3% 82% 92% E4% BD% BF% E3% 81% A3% E3% 81 % 9F% E3% 83% 87% E3% 83% BC% E3% 82% BF% E5% 87% A6% E7% 90% 86-Wes-McKinney / dp / 487311845X / ref = sr_1_1? Adgrpid = 58778961568 & dchild = 1 & gclid = Cj0KCQjwz4z3BRCgARIsAES_OVca29l_HJZEkHR-UKU58RZaO0JEp_5F8-CuX5NBsVuVFIxZOEhTzw4aAnNmEALw_wcB & hvadid = 338517691080 & hvdev = c & hvlocphy = 1009444 & hvnetw = g & hvqmt = e & hvrand = 6973103540103090801 & hvtargid = kwd-332404598096 & hydadcr = 27269_11561183 & jp-ad-ap = 0 & keywords = python% E3% 81% AB% E3% 82% 88% E3% 82% 8B % E3% 83% 87% E3% 83% BC% E3% 82% BF% E5% 88% 86% E6% 9E% 90% E5% 85% A5% E9% 96% 80 & qid = 1592014932 & sr = 8-1 & tag = googhydr -twenty two) -[Basic coding for Python data analysis / machine learning! Introduction to pandas library utilization (impress top gear) --Daniel Y. Chen (Author)](https://www.amazon.co.jp/Python%E3%83 % 87% E3% 83% BC% E3% 82% BF% E5% 88% 86% E6% 9E% 90-% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 81% AE% E3% 81% 9F% E3% 82% 81% E3% 81% AE% E5% 9F% BA% E6% 9C% AC% E3% 82% B3% E3% 83% BC% E3% 83% 87% E3% 82% A3% E3% 83% B3% E3% 82% B0-pandas% E3% 83% A9% E3% 82% A4% E3% 83% 96% E3% 83% A9% E3% 83% AA% E6% B4% BB% E7% 94% A8% E5% 85% A5% E9% 96% 80-impress-gear / dp / 4295005657 / ref = pd_aw_sbs_14_5 / 356-4805604-8432224? _ Encoding = UTF8 & pd_rd_i = 4295005657 & pd_rd_r = 35939a14-656d-4771-bcbc-f72c522feffb & pd_rd_w = QLYKo & pd_rd_wg = tZRuL & pf_rd_p = bff3a3a6-0f6e-4187-bd60-25e75d4c1E-4187-bd60-25e75d4c1
--Video (udemy) -[180,000 people in the world] Practical Python Data Science
Recommended Posts