Last time extracted 30 items (1 page) of data using "Rakuten product search API" and saved it in a csv file. It was.
This time, for the purpose of gathering more practical data for analysis and connection to decision making, we first developed the previous script to increase the number of data to be acquired, and then used product data as a preliminary step for aggregation and analysis. We even process data so that it can be classified by weight (5 kg, 10 kg ...).
--Get 300 items (30 items x 10 pages) of search keyword "make-in" product information (specify NG word) --Create a dict type that contains only the necessary product information --Store in Pandas DataFrame and organize columns
The flow is the same as the last time, but this time the number of acquisitions and NG words are specified here. These are adjusted according to the purpose of collecting data and the amount and contents of information required.
In my case, I specified a word that does not match my expectations and is likely to affect the product price. Here, "organic cultivation / organic / pesticide-reduced / pesticide-free", "set / assortment", and "hometown tax payment" are excluded. There is no deep meaning in setting the number of acquisitions to 300, but when I searched for products with this keyword / NG word, the number of hits was about 320, so I chose a good number.
--Extract only the necessary data from the collected data --Extract the weight (5kg, 10kg ...) from the "product name" and create a new "quantity" column
Originally, it would be more accurate to divide the amount by vegetable variety, but this time I decided not to do so. This is because the number of data will decrease if it is subdivided too much.
This is almost the same as the last time, but only the page
and NGKeyword
and postageFlag
parts of the input parameters that send requests with the API are changed.
The maximum number of data that can be acquired with a single request from the Rakuten Product Search API is 30 (the value of the input parameter hits
). If you specify the number of pages 2
in page
, you can get the data after the 31st.
It is assumed that the value of this page
is turned by the for
statement.
In addition, the input parameter postageFlag
specifies whether the item price includes shipping (1 includes shipping or free shipping).
import requests
import numpy as np
import pandas as pd
REQUEST_URL = "https://app.rakuten.co.jp/services/api/IchibaItem/Search/20170706"
APP_ID="<Enter the Rakuten API ID>"
#Input parameters
serch_keyword = 'Make-in'
ng_keyword = 'Hometown tax payment Organic organic pesticide-free pesticide-free set assortment'
page = 1
serch_params={
"format" : "json",
"keyword" : serch_keyword,
"NGKeyword":ng_keyword,
"applicationId" : [APP_ID],
"availability" : 0,
"hits" : 30,
"page" : page,
"sort" : "standard",
"postageFlag" : 1
}
Use the for
statement to get product information for pages that span multiple pages.
Extracting only the necessary items from the product information, storing them in the dict named tmp_item
, and storing them in the list named ʻitem_list` is the same flow as last time.
#Get product information in a list
item_list = [] #Dictionary-type product information acquired by 30 items tmp_10 pages of item
max_page = 10
for page in range(1, max_page+1):
serch_params['page'] = page
#Send a request to the API and get the product data result as a result
response = requests.get(REQUEST_URL, serch_params)
result = response.json()
#Create a dict that extracts the necessary information from the result
item_key = ['itemName', 'itemPrice', 'itemCaption', 'shopName', 'shopUrl', 'itemUrl']
for i in range(0, len(result['Items'])):
tmp_item = {}
item = result['Items'][i]['Item']
for key, value in item.items():
if key in item_key:
tmp_item[key] = value
item_list.append(tmp_item.copy())
Here, even if I entered a keyword that had less than 10 pages of products, I was able to get it without any error. It seems that the number of pages can be set larger than the actual number of products (according to the API reference, the upper limit of page
is 100).
However, it takes a lot of time to process, and in rare cases an error occurs in my environment, so I think it is better to keep it to the minimum necessary.
This also creates a Pandas DataFrame from the list that stores the dict of the product information earlier, as before. I've modified it a bit so that the index starts from 1 instead of 0.
#Data frame creation
df = pd.DataFrame(item_list)
df = df.reindex(columns=['itemName', 'itemPrice', 'itemCaption', 'itemUrl', 'shopName', 'shopUrl'])
df.columns = ['Product name', 'Product price', 'Product description', 'Product URL', 'Store name', 'Store URL']
df.index = df.index + 1 #Re-index from 1
Check the number of acquired data with df.count ()
and the first 5 data with df.head ()
.
It seems that it is okay if it contains 300 data as expected.
It is troublesome to run this script every time to get the data, so I will output csv so that I can use it when I want to use it.
df.to_csv('20200914_rakuten_mayqueen.csv')
Now that we have increased the number of data items, let's process the data into a form suitable for analysis. You can proceed as it is, but once you load the csv fill, proceed.
df = pd.read_csv('20200914_rakuten_mayqueen.csv')
I will process this DataFrame, but first of all, when I took a quick look at the contents of the data with a spreadsheet, I found that extra data other than vegetables, which does not fit the purpose of this price survey, was mixed. Since we need weight data this time, we will try with the policy of "leaving only product data with weight in the product name".
In Pandas, use str.contains ()
to extract (partially match) lines containing the specified string and return them as bool values.
#Leave only product data with "kg" in the product name
kg_flg = df['Product name'].str.contains('kg')
df = df[kg_flg]
kg_flg
is a series of bool values, True for lines that contain" kg "and False for lines that do not. The row that is True is the data you want to keep.
By using this and setting it to df [kg_flg]
, you can extract a DataFrame that contains only True rows.
When I checked the number of cases with df.count ()
, it decreased to 116 cases.
If you want to secure more data, it seems that this area needs to be verified a little more.
This leaves only the row with kg in the product name, but I would like to cut this weight into another column. The product name should contain the weight in the form of "number + kg", so take out this number and create a new column called "quantity".
Let's use a regular expression here (I won't explain it in detail, but you can express "number + kg" by using ([0-9] +) kg
).
Specify this regular expression as an argument to Pandas's str.extract ()
. This method extracts a string that matches first when a regular expression is specified as an argument, and creates a new column. This is a convenient method that is perfect for what you want to do this time.
#Cut out the weight from the product name to another column
df['Quantity'] = df['Product name'].str.extract('([0-9]+)kg')
df =df.reindex(columns=['Product name', 'Quantity', 'Product price', 'Product description', 'Product URL', 'Store name', 'Store URL'])
df.to_csv('20200914_rakuten_mayqueen_2.csv')
Swap the columns to make it easier to see, and output csv at the end. I was able to create a DataFrame like this.
In the image, after converting the quantity to a numerical type for subsequent analysis, I also added the kg unit price, which is the product price divided by the quantity. However, the numbers are quite different, so I'll dig a little deeper next time.
Next time, try to calculate statistics such as the average value of product prices by aggregating data, and try visualization and linear approximation to kg unit price I will try to do a simple analysis that seems to lead to decision making, such as calculating.
Recommended Posts