This time, I tried scraping the image of "Suzu Hirose" using Google's image search function. I think that you will need some image data when you perform image processing yourself. I hope you will refer to this article as one of the means to acquire images.
This time, when I got an image from Google's image search, I had to scroll to get it. Use selenium to scroll because it cannot be done with Beautiful Soup.
First of all, import everything.
from selenium import webdriver
from time import sleep
from bs4 import BeautifulSoup
import requests
import base64
import os
import re
import shutil
You will need a chromedriver to use selenium. Get it with ChromeDriver --WebDriver for Chrome .
#Now open google
driver = webdriver.Chrome("C:\\Users\\chromedriver")#Specify the path where the driver is located.
driver.get("https://www.google.com/")
sleep(2)
Specifies the location of the search bar. At this time, please use the verification function of Chrome opened in selenium to identify the location. I verified it with Chrome that I originally downloaded, and I got an error because I was doing it based on it. As a result, it took about an hour to find out the cause of the error. .. .. .. .. ..
search_bar = driver.find_element_by_name("q")
#Enter keywords in the search bar
search_bar.send_keys("Hirose Suzu")
search_bar.submit()
sleep(2)
If it goes well, Suzu Hirose will be typed into the search bar to search.
Then move to the image list.
#Move to image screen
img_btn = driver.find_element_by_xpath('//a[@class="q qs"]')
img_btn.click()
I will move to the image list below, so I would like to get the images here.
First, get the URL of the image. This time, when I get the URL of the image, I use BeautifulSoup to find the img tag and get it from there. Most of the image URLs are stored in the data-src of the img tag, but sometimes there are some that do not have data-src, so at that time I am getting from src.
#Scroll the screen.
try:
#The image URL is duplicated in this.
all_images = []
#Scroll 5 times
for i in range(5):
#I'm scrolling the screen here.
driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")
#I'm loading it into Beautiful Soup here.
soup = BeautifulSoup(driver.page_source , "html.parser")
#all_Append image URL to images
for image in soup.find_all("img"):
try:
url = image.get("data-src")
if url is None:
url = image.get("src")
if url is not None:
all_images.append(url)
except:
print("An error occurred when getting the image URL.")
print()
sleep(2)
except Exception:
print("An error occurred while scrolling the screen.")
error_flag = True
And as commented in the code, the image URL is stored in all_images, but the URL is duplicated here. Therefore, we will remove duplicates to make them unique.
all_images = list(dict.fromkeys(all_images))
In this URL, the data was changed to base64 format separately from the https URL. Therefore, you need to use two patterns to download. (1) Download from HTTP (2) Download from base64. This time I created a function to correspond to each pattern.
#Save the image passed by http url.
def img_url_download(url , file_path):
response = requests.get(url , stream = True)
#Save to file,
with open(file_path , 'wb') as file:
shutil.copyfileobj(response.raw , file)
#Function to save base64
#url"data:image/jpeg;base64,"Put in the one with the removed.
def base64_download(url , file_path):
img = base64.b64decode(url.encode())
with open(file_path , "wb") as f:
f.write(img)
After defining the function, save the image in the folder at the end.
#Insert the image data into a file! !!
#File path
path = r"C:\Users\suzu_img_files"#Please specify the path of the folder to save the image
#base64 first"data:image/jpeg;base64,"There is, so try to remove it.
base64_string = "data:image/jpeg;base64,"
for index , image_url in enumerate(all_images):
filename = "suzu_" + str(index) + ".jpg "
file_path = os.path.join(path , filename)
#The if statement branches depending on whether it is base64 or not.
if len(re.findall(base64_string , image_url)) > 0:
url = url.replace(base64_string , "")#The prefix is missing from the url.
base64_download(url , file_path)
else:
img_url_download(image_url , file_path)
If all goes well, the image will be saved as shown below.
How was that? Isn't it possible to expand the range of scraping by using selenium? This time it was Mr. Suzu Hirose, but I think it's good to scrape with people, animals, buildings, etc. that you like! Also, this time I implemented selenium from the search screen of Google because I also implemented it, but if you just want to get the image, it is faster to implement it by making the first URL the URL of Mr. Suzu Hirose's image list. is not it,,,
Reference materials
[Introduction to Python] Scraping images of Kanna Hashimoto. Examples of what Python can do: Download images. Exercises after Progate | Data analysis in Python. Beautiful Soup
ChromeDriver - WebDriver for Chrome
Python-based web scraping (BeautifulSoup, Selenium, Requests) >
Recommended Posts