Hello, this is HanBei.
Previous article was about data collection for machine learning, but we will continue this time.
Continue to use Google Colaboratory.
If you are interested, please ** comment ** or ** LGTM **!
I use the recipe site, but I thought ** "It's a lot" **, ** "Is the really recommended dish delicious? (Excuse me)" **, so I'm going to find the recipe I want.
I hope it will be useful for those who want to get data for machine learning using the data on the Web **.
Scraping is a ** crime ** if you do not follow the usage and dosage correctly.
"I want to do scraping, so don't worry about it!" For those who are optimistic or worried, we recommend that you read at least the two articles below.
miyabisun: "Don't ask questions on the Q & A site about scraping methods" nezuq: "List of precautions for web scraping"
This article teaches you how to scrape, but ** we are not responsible for it **.
Think for yourself and use it with the right ethics.
Recommended such as Cookpad, Nadia, White rice.com There is a recipe site for, but this time we will use ** Rakuten Recipe **.
Reason, ・ There are quantitative data such as "I want to repeat", "It was easy", and "I saved". ・ Many recipes
is.
If you haven't created a Google account to use Google Colaboratory, create one.
How to create a new notebook ...
From here, I will write the contents of the implementation.
First, import the library.
from bs4 import BeautifulSoup
from google.colab import drive
from google.colab import files
import urllib
import urllib.parse
import urllib.request as req
import csv
import random
import pandas as pd
import numpy as np
import time
import datetime
Decide on the name of the dish you want to look up!
#The name of the dish you want to look up
food_name = 'curry'
Create a function to get the URL of the recipe.
#Store the url for each recipe
recipe_url_lists = []
def GetRecipeURL(url):
res = req.urlopen(url)
soup = BeautifulSoup(res, 'html.parser')
#Select a range of recipe list
recipe_text = str(soup.find_all('li', class_= 'clearfix'))
#Divide the acquired text line by line and store it in the list
recipe_text_list = recipe_text.split('\n')
#Read the list line by line and extract only the lines that match the dish name
for text in recipe_text_list:
#Get url for each recipe
if 'a href="/recipe/' in text:
#Specify a specific part and put it in the url
recipe_url_id = text[16:27]
#url join
recipe_url_list = 'https://recipe.rakuten.co.jp/recipe/' + recipe_url_id + '/?l-id=recipe_list_detail_recipe'
#Store url
recipe_url_lists.append(recipe_url_list)
#Get the title of each recipe
if 'h3' in text:
print(text + ", " + recipe_url_list)
Check recipes in order of popularity
#Amount of pages you want to look up
page_count = 2
#Encode to put the dish name in the url
name_quote = urllib.parse.quote(food_name)
#Combine urls (only one page url)
#In order of popularity
base_url = 'https://recipe.rakuten.co.jp/search/' + name_quote
#New arrival order
# base_url = 'https://recipe.rakuten.co.jp/search/' + name_quote + '/?s=0&v=0&t=2'
for num in range(page_count):
#To get after a specific page
# num = num + 50
if num == 1:
#Combine urls (only one page url)
GetRecipeURL(base_url)
if num > 1:
#Combine urls (urls from page 2 onwards)
#In order of popularity
base_url_other = 'https://recipe.rakuten.co.jp/search/' + name_quote + '/' + str(num) + '/?s=4&v=0&t=2'
#New arrival order
# base_url_other = 'https://recipe.rakuten.co.jp/search/' + name_quote + '/' + str(num) + '/?s=0&v=0&t=2'
GetRecipeURL(base_url_other)
#Apply 1 second rule for scraping
time.sleep(1)
After doing so, the title and recipe URL will be displayed.
Let's check the number of recipes obtained here!
#Number of recipes acquired
len(recipe_url_lists)
When executed, 17 items are displayed.
Next, get the necessary data from each recipe.
data_count = []
recipe_data_set = []
def SearchRecipeInfo(url, tag, id_name):
res = req.urlopen(url)
soup = BeautifulSoup(res, 'html.parser')
for all_text in soup.find_all(tag, id= id_name):
# ID
for text in all_text.find_all('p', class_= 'rcpId'):
recipe_id = text.get_text()[7:17]
#release date
for text in all_text.find_all('p', class_= 'openDate'):
recipe_date = text.get_text()[4:14]
#it was delicious,It was easy,I was able to save 3 types of stamps
for text in all_text.find_all('div', class_= 'stampHead'):
for tag in text.find_all('span', class_= 'stampCount'):
data_count.append(tag.get_text())
#I made the number of reports
for text in all_text.find_all('div', class_= 'recipeRepoBox'):
for tag in text.find_all('h2'):
#When the number of reports is 0
if tag.find('span') == None:
report = str(0)
else:
for el in tag.find('span'):
report = el.replace('\n ', '').replace('Case', '')
print("ID: " + recipe_id + ", DATE: " + recipe_date + ",Number made: " + report +
",I want to repeat: " + data_count[0] +
",It was easy: " + data_count[1] +
",I was able to save: " + data_count[2]+
", url: " + url)
#Store to write to csv file
recipe_data_set.append([recipe_id, recipe_date, data_count[0], data_count[1], data_count[2], report, url])
#Empty the array containing the number of stamps
data_count.clear()
#Scraping restrictions
time.sleep(1)
Here, check the acquired data.
for num in range(len(recipe_url_lists)):
SearchRecipeInfo(recipe_url_lists[num], 'div', 'detailContents')
When you execute it, you can see that it has been acquired properly.
Create a spread sheet on Google Drive and output the data
#Mount the directory you want to use
drive.mount('/content/drive')
Select any folder in Google Drive and specify the file name Please specify "○○○".
#Create a folder on google drive and specify the save destination
save_dir = "./drive/My Drive/Colab Notebooks/〇〇〇/"
#Select a file name
data_name = '〇〇〇.csv'
#Save csv file in folder
data_dir = save_dir + data_name
#Add items to csv file
with open(data_dir, 'w', newline='') as file:
writer = csv.writer(file, lineterminator='\n')
writer.writerow(['ID','Release Date','Repeat','Easy','Economy','Report','URL'])
for num in range(len(recipe_url_lists)):
writer.writerow(recipe_data_set[num])
#Save the created file
with open(data_dir, 'r') as file:
sheet_info = file.read()
When executed, 〇〇〇.csv will be output to the specified directory. Please refer to the contents so far in a simple slide.
Check the csv file output by Pandas.
#Load csv
rakuten_recipes = pd.read_csv(data_dir, encoding="UTF-8")
#Ready to add to column
df = pd.DataFrame(rakuten_recipes)
df
The image of the output is omitted.
Next, calculate the number of days elapsed from the publication date of the recipe to today.
# rakuten_recipes.Extract Release Date from csv
date = np.array(rakuten_recipes['Release Date'])
#Get the current date
today = datetime.date.today()
#Match the mold
df['Release Date'] = pd.to_datetime(df['Release Date'], format='%Y-%m-%d')
today = pd.to_datetime(today, format='%Y-%m-%d')
df['Elapsed Days'] = today - df['Release Date']
#I will take out only the value of the number of elapsed days
for num in range(len(df['Elapsed Days'])):
df['Elapsed Days'][num] = df['Elapsed Days'][num].days
#Check only 5 lines from the top
df.head()
Then, the number of elapsed days will appear next to the URL column.
Next, let the number of days elapsed be ** weight **, and weight three types of stamps: Repeat, Easy, and Economy. Add the weighted data to the existing recipe column.
#Fixed not to be too small
weighting = 1000000
#Extract 3 types of stamps and report values
repeat_stamp = np.array(rakuten_recipes['Repeat'])
easy_stamp = np.array(rakuten_recipes['Easy'])
economy_stamp = np.array(rakuten_recipes['Economy'])
report_stamp = np.array(rakuten_recipes['Report'])
#Total of each stamp and report
repeat_stamp_sum = sum(repeat_stamp)
easy_stamp_sum = sum(easy_stamp)
economy_stamp_sum = sum(economy_stamp)
report_stamp_sum = sum(report_stamp)
#Add a column of weighted values
'''
Repeat weighting= (Number of repeat stamps ÷ total repeat) × (Corrected value ÷ number of days elapsed from the publication date)
'''
df['Repeat WT'] = (df['Repeat'] / repeat_stamp_sum) * (weighting / df['Elapsed Days'])
df['Easy WT'] = (df['Easy'] / easy_stamp_sum) * (weighting / df['Elapsed Days'])
df['Economy WT'] = (df['Economy'] / economy_stamp_sum) * (weighting / df['Elapsed Days'])
#Report importance (range 0 to 1)
proportions_rate = 0.5
#Add a column of weighted values
'''
Repeat weighting= (Repeat weighting× (1 -importance)) × ((Number of reports ÷ total number of reports) ×importance[%])
'''
df['Repeat WT'] = (df['Repeat WT'] * (1 - proportions_rate)) * ((df['Report'] / report_stamp_sum) * proportions_rate)
df['Easy WT'] = (df['Easy WT'] * (1 - proportions_rate)) * ((df['Easy WT'] / report_stamp_sum) * proportions_rate)
df['Economy WT'] = (df['Economy WT'] * (1 - proportions_rate)) * ((df['Economy WT'] / report_stamp_sum) * proportions_rate)
About weighting ... Regarding the number of days elapsed, suppose that there are articles one month ago and one year ago, and the same 100 stamps are attached. Which one is more recommended for the recipe is one month ago. Therefore, the score is lower for articles that have passed the number of days.
Change the range of maximum and minimum weighted values from 0 to 1. I will post the page that I used as a reference.
QUANON: "Convert a number in one range to a number in another range"
df['Repeat WT'] = (df['Repeat WT'] - np.min(df['Repeat WT'])) / (np.max(df['Repeat WT']) - np.min(df['Repeat WT']))
df['Easy WT'] = (df['Easy WT'] - np.min(df['Easy WT'])) / (np.max(df['Easy WT']) - np.min(df['Easy WT']))
df['Economy WT'] = (df['Economy WT'] - np.min(df['Economy WT'])) / (np.max(df['Economy WT']) - np.min(df['Economy WT']))
df.head()
This is the execution result. By doing df.head (), only the top 5 lines will be displayed.
The score that flew from the user is evaluated on a 5-point scale and the search is performed.
#Used to specify the range (1: 0-0.2, 2: 0.2-0.4, 3: 0.4-0.6, 4: 0.6-0.8, 5: 0.8-1)
condition_num = 0.2
def PlugInScore(repeat, easy, economy):
#Argument within the specified range
if 1 >= repeat:
repeat = 1
if 5 <=repeat:
repeat = 5
if 1 >= easy:
easy = 1
if 5 <= easy:
easy = 5
if 1 >= economy:
economy = 1
if 5 <= economy:
economy = 5
#Narrow down recipes from 3 types of scores
df_result = df[((repeat*condition_num) - condition_num <= df['Repeat WT']) & (repeat*condition_num >= df['Repeat WT']) &
((easy*condition_num) - condition_num <= df['Easy WT']) & (easy*condition_num >= df['Easy WT']) &
((economy*condition_num) - condition_num <= df['Economy WT']) & (economy*condition_num >= df['Economy WT'])]
# print(df_result)
CsvOutput(df_result)
Output the search result to a csv file. Please enter any name for 〇〇〇!
#Select a file name
data_name = '〇〇〇_result.csv'
#Save csv file in folder
data_dir_result = save_dir + data_name
#Output csv
def CsvOutput(df_result):
#Output the narrowed down result to a csv file
with open(data_dir_result, 'w', newline='') as file:
writer = csv.writer(file, lineterminator='\n')
#title
writer.writerow(df_result)
#Each value
for num in range(len(df_result)):
writer.writerow(df_result.values[num])
#Save the created file
with open(data_dir, 'r') as file:
sheet_info = file.read()
AdviceRecipe()
Declares a function to display the result.
def AdviceRecipe():
#Load csv
rakuten_recipes_result = pd.read_csv(data_dir_result, encoding="UTF-8")
#Ready to add to column
df_recipes_res = pd.DataFrame(rakuten_recipes_result)
print(df_recipes_res)
print("Recommended for you" + food_name + " 」")
print("Entry No.1: " + df_recipes_res['URL'][random.randint(0, len(df_recipes_res))])
print("Entry No.2: " + df_recipes_res['URL'][random.randint(0, len(df_recipes_res))])
print("Entry No.3: " + df_recipes_res['URL'][random.randint(0, len(df_recipes_res))])
Finally, give a score to the recipe you want to make and display the recommendations.
'''
plug_in_score(repeat, easy, economy)Substitute
repeat :Do you want to make it again?
easy :Is it easy to make?
economy:Can you save money and make it?
Evaluate the subjectivity in 5 steps from 1 to 5, and substitute an integer.
1 is negative, 5 is position
'''
PlugInScore(1,1,1)
3 types of scores ** 1 **: Do you want to make it again? ** 1 **: Is it easy to make? ** 1 **: Can you save money?
The execution result is ...
・ Examination of evaluation method The three weighted values are pulled by the highest recipe, resulting in extreme results. Therefore, it is biased to either 1 or 5.
・ There are few articles stamped Approximately ** 10 **% of recipes have stamps and reports, and approximately ** 90 **% of all 0s Therefore, it may be nonsense to evaluate the recipe by points. There should be a great recipe in All 0.
I used this system to search for "pork kimchi" and made it. It was delicious because it was a recommended recipe ^^
It was interesting because I could discover the recipes that were buried.
Thank you to everyone who has read this far. I would be grateful if you could give us your comments and advice ^^
Recommended Posts