Introduction

Hello, this is HanBei.

Previous article was about data collection for machine learning, but we will continue this time.

If you are interested, please ** comment ** or ** LGTM **!

1-1. Purpose

I use the recipe site, but I thought ** "It's a lot" **, ** "Is the really recommended dish delicious? (Excuse me)" **, so I'm going to find the recipe I want.

1-2. Target (reason to read this article)

I hope it will be useful for those who want to get data for machine learning using the data on the Web **.

1-3. Attention

Scraping is a ** crime ** if you do not follow the usage and dosage correctly.

"I want to do scraping, so don't worry about it!" For those who are optimistic or worried, we recommend that you read at least the two articles below.

miyabisun: "Don't ask questions on the Q & A site about scraping methods" nezuq: "List of precautions for web scraping"

1-4. Items to check

This article teaches you how to scrape, but ** we are not responsible for it **.

Think for yourself and use it with the right ethics.

2. Preparation

2-1. Examination of recipe site

Recommended such as Cookpad, Nadia, White rice.com There is a recipe site for, but this time we will use ** Rakuten Recipe **.

Reason, ・ There are quantitative data such as "I want to repeat", "It was easy", and "I saved". ・ Many recipes

is.

2-2. Make Google Colaboratory available

If you haven't created a Google account to use Google Colaboratory, create one.

How to create a new notebook ...

Click Google Colaboratory to get started
Create a new one from Google Drive Reference: shoji9x9 "Summary of how to use Google Colab"

3. Practice

From here, I will write the contents of the implementation.

3-1. Introduction

First, import the library.

from bs4 import BeautifulSoup
from google.colab import drive
from google.colab import files
import urllib
import urllib.parse
import urllib.request as req
import csv
import random
import pandas as pd
import numpy as np
import time
import datetime

Decide on the name of the dish you want to look up!

#The name of the dish you want to look up
food_name = 'curry'

3-2. Get the recipe URL

Create a function to get the URL of the recipe.

#Store the url for each recipe
recipe_url_lists = []

def GetRecipeURL(url):
  res = req.urlopen(url)
  soup = BeautifulSoup(res, 'html.parser')

  #Select a range of recipe list
  recipe_text = str(soup.find_all('li', class_= 'clearfix'))
  #Divide the acquired text line by line and store it in the list
  recipe_text_list = recipe_text.split('\n')

  #Read the list line by line and extract only the lines that match the dish name
  for text in recipe_text_list:
    #Get url for each recipe
    if 'a href="/recipe/' in text:
      #Specify a specific part and put it in the url
      recipe_url_id = text[16:27]
      #url join
      recipe_url_list = 'https://recipe.rakuten.co.jp/recipe/' + recipe_url_id + '/?l-id=recipe_list_detail_recipe'
      #Store url
      recipe_url_lists.append(recipe_url_list)  

    #Get the title of each recipe
    if 'h3' in text:
      print(text + ", " + recipe_url_list)

Check recipes in order of popularity

#Amount of pages you want to look up
page_count = 2

#Encode to put the dish name in the url
name_quote = urllib.parse.quote(food_name)

#Combine urls (only one page url)
#In order of popularity
base_url = 'https://recipe.rakuten.co.jp/search/' + name_quote
#New arrival order
# base_url = 'https://recipe.rakuten.co.jp/search/' + name_quote + '/?s=0&v=0&t=2'

for num in range(page_count):
  #To get after a specific page
  # num = num + 50

  if num == 1:
    #Combine urls (only one page url)
   GetRecipeURL(base_url)

  if num  > 1:
    #Combine urls (urls from page 2 onwards)
    #In order of popularity
    base_url_other =  'https://recipe.rakuten.co.jp/search/' + name_quote + '/' + str(num) + '/?s=4&v=0&t=2'
    #New arrival order
    # base_url_other =  'https://recipe.rakuten.co.jp/search/' + name_quote + '/' + str(num) + '/?s=0&v=0&t=2'
    GetRecipeURL(base_url_other)

  #Apply 1 second rule for scraping
  time.sleep(1)

After doing so, the title and recipe URL will be displayed.

Let's check the number of recipes obtained here!


#Number of recipes acquired
len(recipe_url_lists)

When executed, 17 items are displayed.

Next, get the necessary data from each recipe.


data_count = []
recipe_data_set = []

def SearchRecipeInfo(url, tag, id_name):
  res = req.urlopen(url)
  soup = BeautifulSoup(res, 'html.parser')

  for all_text in soup.find_all(tag, id= id_name):
    # ID
    for text in all_text.find_all('p', class_= 'rcpId'):
      recipe_id = text.get_text()[7:17]
      
    #release date
    for text in all_text.find_all('p', class_= 'openDate'):
      recipe_date = text.get_text()[4:14]

    #it was delicious,It was easy,I was able to save 3 types of stamps
    for text in all_text.find_all('div', class_= 'stampHead'):
      for tag in text.find_all('span', class_= 'stampCount'):
        data_count.append(tag.get_text())

    #I made the number of reports
    for text in all_text.find_all('div', class_= 'recipeRepoBox'):
      for tag in text.find_all('h2'):
        #When the number of reports is 0
        if tag.find('span') == None:
          report = str(0)
        else:
          for el in tag.find('span'):
            report = el.replace('\n							', '').replace('Case', '')

  print("ID: " + recipe_id + ", DATE: " + recipe_date + ",Number made: " + report + 
        ",I want to repeat: " + data_count[0] +
        ",It was easy: " + data_count[1] +
        ",I was able to save: " + data_count[2]+
        ", url: " + url)

  #Store to write to csv file
  recipe_data_set.append([recipe_id, recipe_date, data_count[0], data_count[1], data_count[2], report, url])

  #Empty the array containing the number of stamps
  data_count.clear()

  #Scraping restrictions
  time.sleep(1)

Here, check the acquired data.


for num in range(len(recipe_url_lists)):
  SearchRecipeInfo(recipe_url_lists[num], 'div', 'detailContents')

When you execute it, you can see that it has been acquired properly. Searching_of_Delicious_Food_ipynb_Colaboratory (1).png

3-3. Output csv file to Google Drive

Create a spread sheet on Google Drive and output the data

#Mount the directory you want to use
drive.mount('/content/drive')

Select any folder in Google Drive and specify the file name Please specify "○○○".

#Create a folder on google drive and specify the save destination
save_dir = "./drive/My Drive/Colab Notebooks/〇〇〇/"
#Select a file name
data_name = '〇〇〇.csv'
#Save csv file in folder
data_dir = save_dir + data_name

#Add items to csv file
with open(data_dir, 'w', newline='') as file:
  writer = csv.writer(file, lineterminator='\n')
  writer.writerow(['ID','Release Date','Repeat','Easy','Economy','Report','URL'])

  for num in range(len(recipe_url_lists)):
    writer.writerow(recipe_data_set[num])

#Save the created file
with open(data_dir, 'r') as file:
  sheet_info = file.read()

When executed, 〇〇〇.csv will be output to the specified directory. Please refer to the contents so far in a simple slide.

This time, the output is in order of popularity, not in order of new arrival.

2020_05_22_レシピ検索_Google_スライド.png 2020_05_22_レシピ検索_Google_スライド (1).png

3-4. Weighting of recipe data

Check the csv file output by Pandas.


#Load csv
rakuten_recipes = pd.read_csv(data_dir, encoding="UTF-8")

#Ready to add to column
df = pd.DataFrame(rakuten_recipes)

df

The image of the output is omitted.

Next, calculate the number of days elapsed from the publication date of the recipe to today.


# rakuten_recipes.Extract Release Date from csv
date = np.array(rakuten_recipes['Release Date'])
#Get the current date
today = datetime.date.today()

#Match the mold
df['Release Date'] = pd.to_datetime(df['Release Date'], format='%Y-%m-%d')
today = pd.to_datetime(today, format='%Y-%m-%d')

df['Elapsed Days'] = today - df['Release Date']

#I will take out only the value of the number of elapsed days
for num in range(len(df['Elapsed Days'])):
  df['Elapsed Days'][num] = df['Elapsed Days'][num].days

#Check only 5 lines from the top
df.head()

Then, the number of elapsed days will appear next to the URL column.

Next, let the number of days elapsed be ** weight **, and weight three types of stamps: Repeat, Easy, and Economy. Add the weighted data to the existing recipe column.

2020_05_22_レシピ検索_Google_スライド (2).png


#Fixed not to be too small
weighting = 1000000

#Extract 3 types of stamps and report values
repeat_stamp = np.array(rakuten_recipes['Repeat'])
easy_stamp = np.array(rakuten_recipes['Easy'])
economy_stamp = np.array(rakuten_recipes['Economy'])
report_stamp = np.array(rakuten_recipes['Report'])

#Total of each stamp and report
repeat_stamp_sum = sum(repeat_stamp)
easy_stamp_sum = sum(easy_stamp)
economy_stamp_sum = sum(economy_stamp)
report_stamp_sum = sum(report_stamp)

#Add a column of weighted values
'''
Repeat weighting= (Number of repeat stamps ÷ total repeat) × (Corrected value ÷ number of days elapsed from the publication date)
'''
df['Repeat WT'] = (df['Repeat'] / repeat_stamp_sum) * (weighting / df['Elapsed Days'])
df['Easy WT'] = (df['Easy'] / easy_stamp_sum) * (weighting / df['Elapsed Days'])
df['Economy WT'] = (df['Economy'] / economy_stamp_sum) * (weighting / df['Elapsed Days'])

#Report importance (range 0 to 1)
proportions_rate = 0.5

#Add a column of weighted values
'''
Repeat weighting= (Repeat weighting× (1 -importance)) × ((Number of reports ÷ total number of reports) ×importance[%])
'''
df['Repeat WT'] = (df['Repeat WT'] * (1 - proportions_rate)) * ((df['Report'] / report_stamp_sum) * proportions_rate)
df['Easy WT'] = (df['Easy WT'] * (1 - proportions_rate)) * ((df['Easy WT'] / report_stamp_sum) * proportions_rate)
df['Economy WT'] = (df['Economy WT'] * (1 - proportions_rate)) * ((df['Economy WT'] / report_stamp_sum) * proportions_rate)

About weighting ... Regarding the number of days elapsed, suppose that there are articles one month ago and one year ago, and the same 100 stamps are attached. Which one is more recommended for the recipe is one month ago. Therefore, the score is lower for articles that have passed the number of days.

Change the range of maximum and minimum weighted values from 0 to 1. I will post the page that I used as a reference.

QUANON: "Convert a number in one range to a number in another range"


df['Repeat WT'] = (df['Repeat WT'] - np.min(df['Repeat WT'])) / (np.max(df['Repeat WT']) - np.min(df['Repeat WT']))
df['Easy WT'] = (df['Easy WT'] - np.min(df['Easy WT'])) / (np.max(df['Easy WT']) - np.min(df['Easy WT']))
df['Economy WT'] = (df['Economy WT'] - np.min(df['Economy WT'])) / (np.max(df['Economy WT']) - np.min(df['Economy WT']))

df.head()

This is the execution result. By doing df.head (), only the top 5 lines will be displayed. Searching_of_Delicious_Food_ipynb_Colaboratory (2).png

3-5. Display recommended recipes

The score that flew from the user is evaluated on a 5-point scale and the search is performed.


#Used to specify the range (1: 0-0.2, 2: 0.2-0.4, 3: 0.4-0.6, 4: 0.6-0.8, 5: 0.8-1）
condition_num = 0.2

def PlugInScore(repeat, easy, economy):
  #Argument within the specified range
  if 1 >= repeat:
    repeat = 1
  if 5 <=repeat:
    repeat = 5
  if 1 >= easy:
    easy = 1
  if 5 <= easy:
    easy = 5
  if 1 >= economy:
    economy = 1
  if 5 <= economy:
    economy = 5

  #Narrow down recipes from 3 types of scores
  df_result =  df[((repeat*condition_num) - condition_num <= df['Repeat WT']) & (repeat*condition_num >= df['Repeat WT']) &
                  ((easy*condition_num) - condition_num <= df['Easy WT']) & (easy*condition_num >= df['Easy WT']) &
                  ((economy*condition_num) - condition_num <= df['Economy WT']) & (economy*condition_num >= df['Economy WT'])]
  # print(df_result)

  CsvOutput(df_result)

Output the search result to a csv file. Please enter any name for 〇〇〇!


#Select a file name
data_name = '〇〇〇_result.csv'
#Save csv file in folder
data_dir_result = save_dir + data_name

#Output csv
def CsvOutput(df_result):
  #Output the narrowed down result to a csv file
  with open(data_dir_result, 'w', newline='') as file:
    writer = csv.writer(file, lineterminator='\n')
    #title
    writer.writerow(df_result)
    #Each value
    for num in range(len(df_result)):
      writer.writerow(df_result.values[num])

  #Save the created file
  with open(data_dir, 'r') as file:
    sheet_info = file.read()
  
  AdviceRecipe()

Declares a function to display the result.


def AdviceRecipe():
  #Load csv
  rakuten_recipes_result = pd.read_csv(data_dir_result, encoding="UTF-8")

  #Ready to add to column
  df_recipes_res = pd.DataFrame(rakuten_recipes_result)

  print(df_recipes_res)

  print("Recommended for you" + food_name + " 」")
  print("Entry No.1: " + df_recipes_res['URL'][random.randint(0, len(df_recipes_res))])
  print("Entry No.2: " + df_recipes_res['URL'][random.randint(0, len(df_recipes_res))])
  print("Entry No.3: " + df_recipes_res['URL'][random.randint(0, len(df_recipes_res))])

Finally, give a score to the recipe you want to make and display the recommendations.


'''

plug_in_score(repeat, easy, economy)Substitute
  
  repeat :Do you want to make it again?
  easy   :Is it easy to make?
  economy:Can you save money and make it?

Evaluate the subjectivity in 5 steps from 1 to 5, and substitute an integer.

1 is negative, 5 is position

'''

PlugInScore(1,1,1)

3 types of scores ** 1 **: Do you want to make it again? ** 1 **: Is it easy to make? ** 1 **: Can you save money?

The execution result is ... Searching_of_Delicious_Food_ipynb_Colaboratory (3).png

4. Issues / Problems

・ Examination of evaluation method The three weighted values are pulled by the highest recipe, resulting in extreme results. Therefore, it is biased to either 1 or 5.

・ There are few articles stamped Approximately ** 10 **% of recipes have stamps and reports, and approximately ** 90 **% of all 0s Therefore, it may be nonsense to evaluate the recipe by points. There should be a great recipe in All 0.

5. Consideration

I used this system to search for "pork kimchi" and made it. It was delicious because it was a recommended recipe ^^

It was interesting because I could discover the recipes that were buried.

Thank you to everyone who has read this far. I would be grateful if you could give us your comments and advice ^^

[Python] I made a system to introduce "recipes I really want" from the recipe site!