Aidemy　2020/10/29

Introduction

Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the second post of machine learning pre-processing. Nice to meet you.

This article is a summary of what you learned in "Aidemy" "in your own words". It may contain mistakes and misunderstandings. Please note.

What to learn this time ・ Data acquisition from Excel ・ Data acquisition from database

Data acquisition from Excel

Data read

-The first thing to do in data preprocessing is __ "Read data" __. -Data sources that read data include __ "files", "databases", "websites" __, etc., and this time we will deal with __Excel files and data acquisition from databases, which are often read. ..

-Read data from Excel with pandas. -As you learned in "Data Handling 2", use __pd.read_excel () __ to read the Excel file. -"File link" and "Sheet name in Excel" can be specified in this argument. -Also, you can pass the individually set __pd.ExcelFile ("file name") __ as the first argument.

Combine Excel data

-__ When you want to handle the data organized in multiple sheets as one data __, you need to perform data combination as preprocessing. -As an example, it is conceivable to combine the product data sheet A and the sales data sheet B in the horizontal direction. -Use __pd.merge (left, right, on =) __ for this merge. -Arguments "left" and "right" specify the sheet to be joined on the left / right side, and "on" specifies the name of the column used for joining.

・ Code![Screenshot 2020-10-29 14.43.14.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/e487d03e-75a2-0ef9- 8df3-f855081f1d55.png)

・ Result![Screenshot 2020-10-29 14.43.44.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/963997fe-836e-c819- f87a-563a13a7143a.png)

Filtering Excel data

-If you want to narrow down the data (row) that has a specific character string from the read or combined Excel data (sheet), use __DataFlame.query ('column with specified character string> character string') __ To do. -If you want to specify more than one of this character string, pass it in a list, and change the ">" part of the above code to "==" or "in".

-For example, if you want to extract lines with product_ids 1 and 3 of the data "df" read by pandas, do as follows.

df.query('product_id == [1,3]')

Data aggregation (grouping)

-There is a column called "type" in the loaded product sheet "df", where the data belongs to one of "food" representing food, "drink" representing beverages, and "sweet" representing sweets. And. -Of these, when you want to know the number of __ data belonging to "food" __ etc., it is necessary to aggregate and group the data belonging to "food". -Use __df.groupby ('name of column to group') __ for grouping. -At this time, when grouping with multiple columns, use __ ('column 1','column 2']) __. -The form of the data returned by this groupby method is "GroupBy object", and it is grouped by using methods such as __count (), mean (), sum (), max () __. It is possible to obtain the number of such data. (Already learned in the Pandas course)

・ In this example, do as follows.

gb = df.groupby('type')
gb['food'].count()

Data acquisition from database

Read data from database

-Reading from the database is done by __pd.read_sql () __, but the value passed as an argument is special. -In the first argument, specify the table and column of the database to be read as follows.

''' SELECT Column 1, Column 2 ("," is not required in the last column) FROM table '''

-In the second argument, pass the information of the database to be connected together as follows.

__sqla.create_engine ('connection database + driver name: // connection user name: password @ host name: port number / database name? charset = character code') __

-The flow of data reading is described below. (The information in the database is fictitious)

import sqalchemy as sqla
#Summarize database information
engine = sqla.create_engine('mysql+mysqldb://ngayope:ngayope@mysql-service:3307/database1?charset=utf8')
#From the "products" table of the database "database1", "product"_id "and" product_get "name"
pd.read_sql('''
SELECT
 product_id,
 product_name
FROM products
''',engine)

Join table

-The database tables can be merged with pd.merge (left, right, on =) as in Excel, but in the case of the database, they can be merged more easily. -For joins, use SQL JOIN. -The join method is described below (when joining table B to table A using column 1 as a key).

''' SELECT Table A. Column 1, Table B. Column 1 FROM Table A JOIN table B ON table A. column 1 = table B. column 1 (describe to add) '''

Data refinement

-__ Data narrowing __ is performed using the WHERE clause. When using it, do it in the same line as SELECT and FROM. -Use __ "OR" "AND" __ when you want to narrow down by multiple conditions. "AND" extracts items that satisfy all the conditions.

''' WHERE Table name. Column name = Value you want to extract 1 OR Table name. Column name = Value you want to extract 2 '''

・ Code up to this point (data shows the sales date and list price of the product)![Screenshot 2020-10-29 14.58.27.png](https://qiita-image-store.s3.ap- northeast-1.amazonaws.com/0/698700/f9e63923-93c6-3822-50d1-bb0b09407083.png)

・ Explanation of the above code: __ "SELECT" __ specifies the three data to be output this time. __ "FROM" __ specifies mlprep_sales_products, which is a table of "Which products were sold", and __ "JOIN ON" __ combines this with mlprep_sales and mlprep_products. Finally, in __ "WHERE" __, only those whose column "catalog_price" indicating the list price of the mlprep_products table is 200 or more are extracted.

Data aggregation

-I explained the method using GroupBy of pandas, but SQL also has __GROUP BY clause __ which has the same function, and data can be aggregated using this. -When aggregating data, join the data as needed and then aggregate the data into one table.

''' JOIN Table B ON Table A. Column 1 = Table B. Column 1 GROUP BY columns to aggregate '''

Summary

-Excel data read is _pd.read_excel () ___, data merge is __pd.merge () __, data narrowing is __df.query () __, data aggregation (grouping) is __df.groupby () _ Do with _. -Use __pd.read_sql () __ to read database data. -For the first argument, specify the table and column to be read by __ "SELECT column FROM table" __. -Similarly, use __ "JOIN" __ for data binding, __ "WHERE" __ for narrowing down, and __ "GROUP BY" __ for aggregation. -For the second argument, specify the database information summarized by __sqla.create_engine () __.

This time is over. Thank you for reading until the end.

Preprocessing in machine learning 2 Data acquisition