Purpose

Learn scraping with Python + Beautiful Soup.

background

I want to download only images from the website, I easily thought that it would be easy if I could scrape it, but it was unexpectedly difficult, so I will write what I did for the time being.

Overview

I wanted to learn scraping, so I won't make a part to download the image. It uses the downloader's free software Irvine. Also, rename the downloaded image file to the numbered file name and combine it into a Zip file. The flow is as follows.

Create a URL list of images from the website with the creation tool and copy it to the clipboard
Paste into Irvine and download
Rename the file with the creation tool and compress it into a zip file

By the way, if you use Irvine's functions properly, you can do everything without making such a thing! Don't say that. The purpose is to learn scraping.

Environment and settings

It was carried out on Windows 10. If you are using chocolatey, to install python3, start cmd or Windows PowerShell with administrator privileges and execute the following command.

> choco install python

If there are choices on the way, y + Enter all. After the installation is complete, reopen cmd or PowerShell and execute the following command.

> pip install requests
> pip install bs4
> pip install pyperclip

Download the Source code (zip) from Git and unzip it. Let the expanded path be "Git / transing /". Download and install Irvine and start it. Create a new "folder01" folder in the default folder. The newly created folder "folder01" is the script "HTML2imglist.py" Let's say a path with. You can change it later by selecting "Folder Settings" from the right-click context menu of "folder01". "Folder 01" has been added to Irvine. Select "Tools"-"Option Settings" from the menu. Open the tab "Clipboard" and turn on the check box "Register directly from clipboard". Click the OK button to close. Alternatively, you may have to select "Manage"-"Clipboard Monitor" from the menu to turn it on. At my hand, it worked whether it was ON or OFF.

How to use

The explanation will target the website, which has open data from the local Ishikawa prefecture.

The following thumbnail images of scenic spots are targeted.

If you start the command prompt (cmd) with the "Git / traning" path, move the path to "Git / traning / python / Web_scraping". Execute the script by specifying the URL of the website containing the image you want to download as an argument.

> cd .\python\Web_scraping
> python Html2imglist.py https://www.hot-ishikawa.jp/photo/

Then the title and the URL list of the image are copied to the clipboard. Start Irvine, paste it in "folder01" and the download will start, so wait until it completes. If you return to the command prompt and press any key, the downloaded image file will be renamed to the numbered file name and combined into a Zip file. ↑ "folder01.zip" is created. If you press any key, the "folder01" folder will be emptied. Try dragging and dropping "folder01.zip" to a viewer software, such as Image Viewer. It was displayed safely. Image Viewer switches to the next slide with the → key and the previous slide with the ← key.

Scraping

Display the source code of the targeted site "Photo Material Download | Hot Ishikawa Travel Net" and check the title tag. Expressing this with a CSS selector gives "html head title". Also, check the tag structure up to the image file. The file you want to download has the following src attribute.

Line 475:<img class="img-responsive" src="/photo/thumbnail/749/trim/1/1?v=0ca07195022078860363c009b75962f59c80bde5" alt="Kenrokuen">
～
Line 486:<img class="img-responsive" src="/photo/thumbnail/740/trim/1/1?v=f4145f658b274299f83a6038ef58f9b8d0cb5ac1" alt="Kanazawa Station">

The sequence of tags up to this point is as follows.

<html>
    <body>
        ～
        <div class="photoItems">
            <ul>
                <li>
                    <div class="photoItem">
                        <a>
                            <img src="1st target image">
                        </a>
                    </div>
                 </li>
                <li>
                    <div class="photoItem">
                        <a>
                            <img src="2nd target image">
                        </a>
                    </div>
                 </li>

Expressing these with CSS selectors is "html body div .photoItems ul li div .photoItem a img". A little abbreviation is used as "html body div .photoItem img".

Since the way this image file is described varies depending on the website, it can be specified with the following variables in the HTML2imglist.py file.

Line 50: title_css_select = 'html head title'
Line 51: img_css_select = 'html body div .photoItem img'
Line 52: img_attr = 'src'

in conclusion

I thought the CSS selector was scraping itself. With that in mind, this article hasn't learned anything about scraping, but it must be a blame.

Try scraping with Python + Beautiful Soup