This article is based on the content of the following video previously published on YouTube.
The content covered in this article is for learning purposes only. In some cases, scraping may lead to a load on the other server, so please do it within the bounds of common sense. </ font>
First of all, I will paste the actual code written in the video below.
!pip install gspread
I had a habit of explicitly typing gspread
to install it, but I didn't need it ...
Even if I installed it this way, it didn't make sense because it was the same as the version originally installed in Colab.
For more information on this, please refer to the following article posted earlier. https://qiita.com/safa/items/bfa52430f920ac562bec#gspread%E3%81%AE%E3%82%A4%E3%83%B3%E3%82%B9%E3%83%88%E3%83%BC%E3%83%AB
from google.colab import auth
from oauth2client.client import GoogleCredentials
import gspread
auth.authenticate_user()
gc = gspread.authorize(GoogleCredentials.get_application_default())
Code for accessing Google Sheets on Google Colab
.
In the middle of myself, it has become not the same
.
If you execute this code, the URL will be displayed. If you follow the instructions on the screen, the authentication itself will be completed without any problems. (It is easier if you log in with your own Google account in advance)
Here is the actual Python code. Please enter the Google Spreadsheet URL that you want to save to.
from bs4 import BeautifulSoup
import requests
workbook_url = "URL of Google Spreadsheet to save to"
workbook = gc.open_by_url(workbook_url)
mercari_url = "https://www.mercari.com"
fetch_path = "/jp/category/967/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4222.0 Safari/537.36'}
# fetch_url: https://www.mercari.com/jp/category/967/
fetch_url = mercari_url + fetch_path
print("Acquisition URL: " + fetch_url)
r = requests.get(fetch_url, headers=headers)
soup = BeautifulSoup(r.text, "lxml")
title = soup.find('title').get_text()
worksheet = workbook.add_worksheet(title=title, rows=100, cols=4)
item_list = soup.find_all("li", class_="sc-bwzfXH")
result_list = []
worksheet.append_row(["ID", "title", "price", "Detailed URL", "image"])
for i, item in enumerate(item_list):
item_title = item.find("span").get_text()
item_price = item.find("div", class_="style_thumbnail__N_xAi").get_text()
item_url = mercari_url + item.find("a")["href"]
image = "=IMAGE(\"" + item.find("img")["src"] + "\")"
worksheet.append_row([i, item_title, item_price, item_url, image], value_input_option="USER_ENTERED")
print("Done!")
If the execution is successful, the data will be written with the image displayed in the Google spreadsheet as shown below.
I will give some supplementary explanations about the code I wrote.
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4222.0 Safari/537.36'}
・
・
・
r = requests.get(fetch_url, headers=headers)
The Mercari page seems to return 403
for access without ʻUser-Agent. In requests, it is possible to set ʻUser-Agent
as an option in the second argument of get
, so access is performed after setting this.
The content of ʻUser-Agent is that of
Google Chrome Canarywhen creating the video. The value displayed in the
Network tab of
Devtools` is used as it is.
soup = BeautifulSoup(r.text, "lxml")
Since lxml
is installed from the beginning on Google Colab
, writing such code suddenly does not cause an error.
If you want to run the same code locally, you also need to have pip installed on lxml
.
title = soup.find('title').get_text()
worksheet = workbook.add_worksheet(title=title, rows=100, cols=4)
When I execute the Python script, I try to create a sheet with the title of the acquired page as the sheet name. In the ʻadd_worksheet` function, the height and width of the sheet to be created are specified, but it seems that they will be added without permission even if they are not specified. (For the time being, it is specified in 100 rows and 4 columns)
image = "=IMAGE(\"" + item.find("img")["src"] + "\")"
worksheet.append_row([i, item_title, item_price, item_url, image], value_input_option="USER_ENTERED")
To display an image in Google Sheets, use the IMAGE function provided by Google Sheets.
Looking at the docs, it seemed that some options were offered, but I didn't use them this time. I really wanted to resize the cell to fit the image, but unfortunately that feature wasn't provided, so I'm just passing the image URL.
There is no mode to resize cells to fit the image.
Quoted from the help page above (I wanted to use this feature ...)
Even if you pass the image URL as it is, the append_row
function of gspread
will be '= IMAGE
and will be interpreted as a character string.
Therefore, value_input_option =" USER_ENTERED "
is specified as an option.
By specifying this, the behavior will be the same as when the user types the character on the screen on Google Sheets.
In this case, when the user types in the value = IMAGE ("...")
as it is, it will be interpreted in the same way, so the spreadsheet will display the image as intended.
■ Reference document API Reference - gspread(append_row) ValueInputOption | Sheet API
That's it.
I occasionally upload Python related videos to YouTube. Please take a look if you like!
Learn Python with Safa's Colab
Recommended Posts