Last time, in Basic machine learning procedure: ① Classification model, I organized the procedure for creating a really basic classification model. However, the place to prepare the data was broken. So, this time, I would like to organize the procedure to create the following table using BigQuery and import it into the Python environment.
id | result | product1 | product2 | product3 | product4 | product5 |
---|---|---|---|---|---|---|
001 | 1 | 2500 | 1200 | 1890 | 530 | null |
002 | 0 | 750 | 3300 | null | 1250 | 2000 |
Google BigQuery Google Colaboratory
Purchasing data is stored in the table structure as shown below.
id | Store name | Purchase date | Purchase goods | Purchase price | Number of purchases |
---|---|---|---|---|---|
001 | 〇〇 store | 2020-01-10 | product1 | 2500 | 1 |
002 | ☓☓ store | 2020-01-10 | product1 | 750 | 3 |
002 | ☓☓ store | 2020-01-10 | product2 | 1000 | 2 |
The table name is as follows on the BQ. (Divided into tables by day)
myproject.mydataset.transaction_ date
Unfortunately, the item names cannot be given in Japanese, but for the sake of clarity, they are in Japanese here. Please change the name according to each environment.
I wrote that I will do it on BigQuery, but all the actual operations are done from Python (Google Colaboratory). Notebook is also useful for keeping a record of what you have done.
To operate BigQuery from Colaboratory, you can easily authenticate with the following two lines. When the URL is displayed, access it with your own account, issue a one-time password, and paste it into the cell displayed in Notebook.
from google.colab import auth
auth.authenticate_user()
Now, let's proceed with the process of actually aggregating the original purchase data by person and product.
from google.cloud import bigquery
#Here, the target period is January 2020.
from_day = "20200101"
to_day = "20200131"
query=f"""
SELECT
id,Purchase goods, SUM(Purchase price)AS total amount
FROM `myproject.mydataset.transaction_*`
WHERE _table_suffix between `{from_day}` and `{to_day}`
"""
#Project data set table name to output
project = "myproject"
client = bigquery.Client(project=project)
dataset = "mydataset"
ds = client.dataset(dataset )
table = "tmp"
job_config = bigquery.QueryJobConfig()
job_config.destination = ds.table(table)
job_config.write_disposition="WRITE_TRUNCATE" #When overwriting the table. If you want to add"WRITE_APPEND"
job_config.allow_large_results=True #Do you allow large-scale results? Basically True
job_config.use_legacy_sql=False #Do you use legacy SQL? Basically False
job = client.query(query, job_config=job_config)
result = job.result()
Now, when the execution is completed without any output, the execution result is output to myproject.mydataset.tmp. The settings in job_config are a little troublesome, but at first let's write while being aware of what each setting is. (Once you get used to it, you can copy it)
If you run it so far, you will see the following table.
id | Purchase goods | total fee |
---|---|---|
001 | product1 | 2500 |
001 | product2 | 1200 |
002 | product1 | 750 |
002 | product2 | 3300 |
Next, I would like to create a PivotTable of id x purchase price to load into the machine learning model.
If you use Pandas with Python, it is easy to use pivot_table, but depending on the target period etc., the data size is several GB or several tens of GB, and it takes extra time to import and process data, so even PivotTable conversion I'm doing it with BigQuery.
new_columns=[]
#Create a list of "purchased products" to be held horizontally and create a Dataframe
new_column="Purchase goods"
query = f'SELECT DISTINCT {new_column} FROM `{project}.{dataset}.{tmp}` ORDER BY {new_column}'
df_product = client.query(query).to_dataframe()
##Automatically generate SQL to be held horizontally by looping the Data frame of the purchased product
for index, row in df_product.iterrows():
item_frag = []
condition_frag = []
for i, v in zip(row.index, row.values):
formula=f"{i}='{v}'"
condition_frag.append(formula)
item_frag.append(v)
item=''.join(item_frag)
condition=' AND '.join(condition_frag)
query_frag = f'SUM(IF(({condition}),"Purchase price",0)) AS {new_column}_{item}'
new_columns.append(query_frag)
# new_Query a List called columns_Change to the string parts
query_parts = ',\n'.join(new_columns)
query = f'SELECT id + query_parts + FROM `{project}.{dataset}.{tmp}` GROUP BY id'
#Output destination table name
table2="pivoted_table"
#BigQuery execution is the same as above
job_config = bigquery.QueryJobConfig()
job_config.destination = ds.table(table2)
job_config.write_disposition="WRITE_TRUNCATE" #When overwriting the table. If you want to add"WRITE_APPEND"
job_config.allow_large_results=True #Do you allow large-scale results? Basically True
job_config.use_legacy_sql=False #Do you use legacy SQL? Basically False
job = client.query(query, job_config=job_config)
result = job.result()
The process of creating SQL for Pivot Table is a bit complicated because it loops twice, but I want to create the following SQL.
SELECT
id,
SUM(IF((Purchase goods='product1'),Purchase price,0)) AS Purchase goods_product1,
SUM(IF((Purchase goods='product2'),Purchase price,0)) AS Purchase goods_product2,
SUM(IF((Purchase goods='product3'),Purchase price,0)) AS Purchase goods_product3,
...
FROM `myproject.mydataset.tmp`
GROUP BY id
It is a standard SQL that makes a table Pivot, but the SUM (IF (~)) part of this is automatically created according to the number of products. So far, the following table is saved as myproject.mydataset.pivoted_table.
id | product1 | product2 | product3 | product4 | product5 |
---|---|---|---|---|---|
001 | 2500 | 1200 | 1890 | 530 | null |
002 | 750 | 3300 | null | 1250 | 2000 |
Finally, import the data created by BigQuery into the Python environment with Pandas Dataframe. However, since I have already imported it with Dataframe in the middle of Pivot conversion, it is still more time to write a program. .. ..
query = f'SELECT * FROM `{project}.{dataset}.{table2}`'
df = client.query(query).to_dataframe()
While saying "basic machine learning procedure", this time I started with operations with BigQuery. From Colaboratory, authentication is easy, so I feel that the hurdle to BigQuery operation is low.
BigQuery is still powerful enough, but new features are released every day. I want to make good use of new functions.
Recommended Posts