I tried to automatically collect images of Kanna Hashimoto with Python! !!

Scraping

A technology that automatically extracts information from web pages. This time, Kanna Hashimoto's images will be automatically collected from the image search result page of the search engine.

What to implement

  1. Access the URL of image search results
  2. Pagination
  3. Get the URL list of images
  4. Download

Details are introduced in the following video https://youtu.be/gqzC0jHdpgw

English edition https://youtu.be/XKrDqGPSfVw

Source code

scraping.py


import requests
from bs4 import BeautifulSoup
import urllib.request
import time

def scraping(url, max_page_num):
    #Pagination implementation
    page_list = get_page_list(url, max_page_num)
    #Get image URL list
    all_img_src_list = []
    for page in page_list:
        img_src_list = get_img_src_list(page)
        all_img_src_list.extend(img_src_list)
    return all_img_src_list


def get_img_src_list(url):
    #Access the search results page
    response = requests.get(url)
    #Parse response
    soup = BeautifulSoup(response.text, 'html.parser')
    img_src_list = [img.get('src') for img in soup.select('p.tb img')]
    return img_src_list


def get_page_list(url, max_page_num):
    img_num_per_page = 20
    page_list = [f'{url}{i*img_num_per_page+1}' for i in range(max_page_num)]
    return page_list

def download_img(src, dist_path):
    time.sleep(1)
    with urllib.request.urlopen(src) as data:
        img = data.read()
        with open(dist_path, 'wb') as f:
            f.write(img)
        
    
def main():
    url = "https://search.yahoo.co.jp/image/search?p=%E6%A9%8B%E6%9C%AC%E7%92%B0%E5%A5%88&ei=UTF-8&b="
    MAX_PAGE_NUM = 1
    all_img_src_list = scraping(url, MAX_PAGE_NUM)
    
    #Image download
    for i, src in enumerate(all_img_src_list):
        download_img(src, f'./img/kanna_{i}.jpg')


if __name__ == '__main__':
    main()

Recommended Posts

I tried to automatically collect images of Kanna Hashimoto with Python! !!
I tried "morphology conversion" of images with Python + OpenCV
I tried to automatically send the literature of the new coronavirus to LINE with Python
[Python] I tried to automatically create a daily report of YWT with Outlook mail
I tried to fix "I tried stochastic simulation of bingo game with Python"
I tried to improve the efficiency of daily work with Python
I tried hundreds of millions of SQLite with python
I tried to get CloudWatch data with Python
I tried to output LLVM IR with Python
I tried to automate sushi making with python
I tried to get the authentication code of Qiita API with Python.
I tried to automatically extract the movements of PES players with software
I tried to streamline the standard role of new employees with Python
[Outlook] I tried to automatically create a daily report email with Python
I tried to get the movie information of TMDb API with Python
I tried fp-growth with python
I tried scraping with Python
I tried to read and save automatically with VOICEROID2 2
I tried to summarize how to use matplotlib of python
I tried to make a simple mail sending application with tkinter of Python
I tried to implement Minesweeper on terminal with python
I tried to get started with blender python script_Part 01
I tried to touch the CSV file with Python
I tried to draw a route map with Python
[OpenCV / Python] I tried image analysis of cells with OpenCV
I tried to solve the soma cube with python
I tried to automatically read and save with VOICEROID2
I tried to implement an artificial perceptron with python
I tried to easily visualize the tweets of JAWS DAYS 2017 with Python + ELK
I tried to solve the problem with Python Vol.1
I tried to analyze J League data with Python
I tried gRPC with Python
I tried scraping with python
[Python] I tried to get Json of squid ring 2
I tried to summarize the string operations of Python
I tried to solve AOJ's number theory with Python
I tried to put out the frequent word ranking of LINE talk with Python
I want to collect a lot of images, so I tried using "google image download"
I tried to automate the article update of Livedoor blog with Python and selenium.
I tried to automatically generate a character string to be input to Mr. Adjustment with Python
I tried to automatically collect erotic images from Twitter using GCP's Cloud Vision API
I tried to compare the processing speed with dplyr of R and pandas of Python
The 15th offline real-time I tried to solve the problem of how to write with python
I tried "gamma correction" of the image with Python + OpenCV
I want to specify another version of Python with pyvenv
I tried to make various "dummy data" with Python faker
I tried to find the average of the sequence with TensorFlow
I tried various methods to send Japanese mail with Python
I tried to automatically create a report with Markov chain
I tried running Movidius NCS with python of Raspberry Pi3
I want to automatically attend online classes with Python + Selenium!
[Python] I tried to visualize tweets about Corona with WordCloud
[Python] I tried to visualize the follow relationship of Twitter
Mayungo's Python Learning Episode 3: I tried to print numbers with print
I tried to implement ListNet of rank learning with Chainer
I tried to make GUI tic-tac-toe with Python and Tkinter
I tried a stochastic simulation of a bingo game with Python
I tried to implement blackjack of card game in Python
I tried to touch Python (installation)
How to collect images in Python
I want to debug with Python