Introduction

Last time, in Basic machine learning procedure: ① Classification model, I organized the procedure for creating a really basic classification model. However, the place to prepare the data was broken. So, this time, I would like to organize the procedure to create the following table using BigQuery and import it into the Python environment.

id	result	product1	product2	product3	product4	product5
001	1	2500	1200	1890	530	null
002	0	750	3300	null	1250	2000

Analytical environment

Google BigQuery Google Colaboratory

Target data

Purchasing data is stored in the table structure as shown below.

id	Store name	Purchase date	Purchase goods	Purchase price	Number of purchases
001	〇〇 store	2020-01-10	product1	2500	1
002	☓☓ store	2020-01-10	product1	750	3
002	☓☓ store	2020-01-10	product2	1000	2

The table name is as follows on the BQ. (Divided into tables by day) myproject.mydataset.transaction_ date

Unfortunately, the item names cannot be given in Japanese, but for the sake of clarity, they are in Japanese here. Please change the name according to each environment.

1. What to do on BigQuery

I wrote that I will do it on BigQuery, but all the actual operations are done from Python (Google Colaboratory). Notebook is also useful for keeping a record of what you have done.

To operate BigQuery from Colaboratory, you can easily authenticate with the following two lines. When the URL is displayed, access it with your own account, issue a one-time password, and paste it into the cell displayed in Notebook.


from google.colab import auth
auth.authenticate_user()

1-1. Processing of purchasing data

Now, let's proceed with the process of actually aggregating the original purchase data by person and product.

from google.cloud import bigquery

#Here, the target period is January 2020.
from_day = "20200101"
to_day = "20200131"

query=f"""
SELECT
 id,Purchase goods, SUM(Purchase price)AS total amount
 FROM `myproject.mydataset.transaction_*`
 WHERE _table_suffix between `{from_day}` and `{to_day}`
"""

#Project data set table name to output
project = "myproject"
client = bigquery.Client(project=project)
dataset = "mydataset"
ds = client.dataset(dataset )
table = "tmp"

job_config = bigquery.QueryJobConfig()
job_config.destination = ds.table(table)
job_config.write_disposition="WRITE_TRUNCATE" #When overwriting the table. If you want to add"WRITE_APPEND"
job_config.allow_large_results=True #Do you allow large-scale results? Basically True
job_config.use_legacy_sql=False #Do you use legacy SQL? Basically False
job = client.query(query, job_config=job_config)

result = job.result()

Now, when the execution is completed without any output, the execution result is output to myproject.mydataset.tmp. The settings in job_config are a little troublesome, but at first let's write while being aware of what each setting is. (Once you get used to it, you can copy it)

If you run it so far, you will see the following table.

id	Purchase goods	total fee
001	product1	2500
001	product2	1200
002	product1	750
002	product2	3300

1-2. Processing to a personal table

Next, I would like to create a PivotTable of id x purchase price to load into the machine learning model.

If you use Pandas with Python, it is easy to use pivot_table, but depending on the target period etc., the data size is several GB or several tens of GB, and it takes extra time to import and process data, so even PivotTable conversion I'm doing it with BigQuery.


new_columns=[]

#Create a list of "purchased products" to be held horizontally and create a Dataframe
new_column="Purchase goods"
query = f'SELECT DISTINCT {new_column} FROM `{project}.{dataset}.{tmp}` ORDER BY {new_column}'
df_product = client.query(query).to_dataframe()

##Automatically generate SQL to be held horizontally by looping the Data frame of the purchased product
for index, row in df_product.iterrows():
	item_frag = []
	condition_frag = []

	for i, v in zip(row.index, row.values):
		formula=f"{i}='{v}'"
		condition_frag.append(formula)
		item_frag.append(v)

	item=''.join(item_frag)
	condition=' AND '.join(condition_frag)
	query_frag = f'SUM(IF(({condition}),"Purchase price",0)) AS {new_column}_{item}'
	new_columns.append(query_frag)

# new_Query a List called columns_Change to the string parts
query_parts = ',\n'.join(new_columns)
query = f'SELECT id + query_parts + FROM `{project}.{dataset}.{tmp}` GROUP BY id'

#Output destination table name
table2="pivoted_table"

#BigQuery execution is the same as above
job_config = bigquery.QueryJobConfig()
job_config.destination = ds.table(table2)
job_config.write_disposition="WRITE_TRUNCATE" #When overwriting the table. If you want to add"WRITE_APPEND"
job_config.allow_large_results=True #Do you allow large-scale results? Basically True
job_config.use_legacy_sql=False #Do you use legacy SQL? Basically False
job = client.query(query, job_config=job_config)

result = job.result()

The process of creating SQL for Pivot Table is a bit complicated because it loops twice, but I want to create the following SQL.

SELECT
 id,
SUM(IF((Purchase goods='product1'),Purchase price,0)) AS Purchase goods_product1,
SUM(IF((Purchase goods='product2'),Purchase price,0)) AS Purchase goods_product2,
SUM(IF((Purchase goods='product3'),Purchase price,0)) AS Purchase goods_product3,
...
FROM `myproject.mydataset.tmp`
GROUP BY id

It is a standard SQL that makes a table Pivot, but the SUM (IF (~)) part of this is automatically created according to the number of products. So far, the following table is saved as myproject.mydataset.pivoted_table.

id	product1	product2	product3	product4	product5
001	2500	1200	1890	530	null
002	750	3300	null	1250	2000

2. Import Python into the environment

Finally, import the data created by BigQuery into the Python environment with Pandas Dataframe. However, since I have already imported it with Dataframe in the middle of Pivot conversion, it is still more time to write a program. .. ..

query = f'SELECT * FROM `{project}.{dataset}.{table2}`'
df = client.query(query).to_dataframe()

in conclusion

While saying "basic machine learning procedure", this time I started with operations with BigQuery. From Colaboratory, authentication is easy, so I feel that the hurdle to BigQuery operation is low.

BigQuery is still powerful enough, but new features are released every day. I want to make good use of new functions.

Basic machine learning procedure: ② Prepare data