Try scraping with Python + Beautiful Soup

Purpose

Learn scraping with Python + Beautiful Soup.

background

I want to download only images from the website, I easily thought that it would be easy if I could scrape it, but it was unexpectedly difficult, so I will write what I did for the time being.

Overview

I wanted to learn scraping, so I won't make a part to download the image. It uses the downloader's free software Irvine. Also, rename the downloaded image file to the numbered file name and combine it into a Zip file. The flow is as follows.

  1. Create a URL list of images from the website with the creation tool and copy it to the clipboard
  2. Paste into Irvine and download
  3. Rename the file with the creation tool and compress it into a zip file

By the way, if you use Irvine's functions properly, you can do everything without making such a thing! Don't say that. The purpose is to learn scraping.

Environment and settings

It was carried out on Windows 10. If you are using chocolatey, to install python3, start cmd or Windows PowerShell with administrator privileges and execute the following command.

> choco install python

If there are choices on the way, y + Enter all. After the installation is complete, reopen cmd or PowerShell and execute the following command.

> pip install requests
> pip install bs4
> pip install pyperclip

Download the Source code (zip) from Git and unzip it. Let the expanded path be "Git / transing /". Download and install Irvine and start it. image.png Create a new "folder01" folder in the default folder. image.png The newly created folder "folder01" is the script "HTML2imglist.py" Let's say a path with. image.png You can change it later by selecting "Folder Settings" from the right-click context menu of "folder01". image.png "Folder 01" has been added to Irvine. image.png Select "Tools"-"Option Settings" from the menu. Open the tab "Clipboard" and turn on the check box "Register directly from clipboard". image.png Click the OK button to close. Alternatively, you may have to select "Manage"-"Clipboard Monitor" from the menu to turn it on. At my hand, it worked whether it was ON or OFF.

How to use

The explanation will target the website, which has open data from the local Ishikawa prefecture.

The following thumbnail images of scenic spots are targeted. image.png

If you start the command prompt (cmd) with the "Git / traning" path, move the path to "Git / traning / python / Web_scraping". Execute the script by specifying the URL of the website containing the image you want to download as an argument.

> cd .\python\Web_scraping
> python Html2imglist.py https://www.hot-ishikawa.jp/photo/

image.png Then the title and the URL list of the image are copied to the clipboard. Start Irvine, paste it in "folder01" and the download will start, so wait until it completes. image.png If you return to the command prompt and press any key, the downloaded image file will be renamed to the numbered file name and combined into a Zip file. image.png ↑ "folder01.zip" is created. image.png If you press any key, the "folder01" folder will be emptied. Try dragging and dropping "folder01.zip" to a viewer software, such as Image Viewer. image.png It was displayed safely. Image Viewer switches to the next slide with the → key and the previous slide with the ← key.

Scraping

Display the source code of the targeted site "Photo Material Download | Hot Ishikawa Travel Net" and check the title tag. image.png Expressing this with a CSS selector gives "html head title". Also, check the tag structure up to the image file. image.png The file you want to download has the following src attribute.

Line 475:<img class="img-responsive" src="/photo/thumbnail/749/trim/1/1?v=0ca07195022078860363c009b75962f59c80bde5" alt="Kenrokuen">
~
Line 486:<img class="img-responsive" src="/photo/thumbnail/740/trim/1/1?v=f4145f658b274299f83a6038ef58f9b8d0cb5ac1" alt="Kanazawa Station">

The sequence of tags up to this point is as follows.

<html>
    <body>
        ~
        <div class="photoItems">
            <ul>
                <li>
                    <div class="photoItem">
                        <a>
                            <img src="1st target image">
                        </a>
                    </div>
                 </li>
                <li>
                    <div class="photoItem">
                        <a>
                            <img src="2nd target image">
                        </a>
                    </div>
                 </li>

Expressing these with CSS selectors is "html body div .photoItems ul li div .photoItem a img". A little abbreviation is used as "html body div .photoItem img".

Since the way this image file is described varies depending on the website, it can be specified with the following variables in the HTML2imglist.py file.

Line 50: title_css_select = 'html head title'
Line 51: img_css_select = 'html body div .photoItem img'
Line 52: img_attr = 'src'

in conclusion

I thought the CSS selector was scraping itself. With that in mind, this article hasn't learned anything about scraping, but it must be a blame.

Related Links

Targeted site -Download Stock Photos | Hot Ishikawa Travel Net

reference

Recommended Posts

Try scraping with Python + Beautiful Soup
Scraping with Python and Beautiful Soup
Try scraping with Python.
Scraping with Beautiful Soup
Table scraping with Beautiful Soup
Scraping with Python
Scraping multiple pages with Beautiful Soup
Scraping with Python
Scraping pages with pagination with Beautiful Soup
Scraping with Beautiful Soup in 10 minutes
Website scraping with Python's Beautiful Soup
Try HTML scraping with a Python library
[Python] Scraping a table using Beautiful Soup
[For beginners] Try web scraping with Python
Scraping with Python (preparation)
Scraping with Python + PhantomJS
My Beautiful Soup (Python)
Scraping with Selenium [Python]
Scraping with Python + PyQuery
Scraping RSS with Python
Write a basic headless web scraping "bot" in Python with Beautiful Soup 4
I tried scraping with Python
Web scraping with python + JupyterLab
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
Scraping with chromedriver in python
Festive scraping with Python, scrapy
Try Python output with Haxe 3.2
Scraping with Selenium in Python
Scraping with Tor in Python
Scraping weather forecast with python
Try running Python with Try Jupyter
Scraping with Selenium + Python Part 2
Try face recognition with Python
I tried scraping with python
Web scraping beginner with python
Crawl practice with Beautiful Soup
[Python] Delete by specifying a tag with Beautiful Soup
Try scraping the data of COVID-19 in Tokyo with Python
Beautiful Soup
Scraping with Node, Ruby and Python
Web scraping with Python ① (Scraping prior knowledge)
Scraping with Selenium in Python (Basic)
Scraping with Python, Selenium and Chromedriver
Try to operate Facebook with Python
[Python] A memorandum of beautiful soup4
Try singular value decomposition with Python
Web scraping with Python First step
I tried web scraping with python.
Try python
[Scraping] Python scraping
Let's do image scraping with Python
Try face recognition with python + OpenCV
Scraping Google News search results in Python (2) Use Beautiful Soup
Get Qiita trends with Python scraping
I tried various things with Python: scraping (Beautiful Soup + Selenium + PhantomJS) and morphological analysis.
Try frequency control simulation with Python
"Scraping & machine learning with Python" Learning memo
Get weather information with Python & scraping
[Raspberry Pi] Scraping of web pages that cannot be obtained with python requests + Beautiful Soup
Get property information by scraping with python