Introduction

I'm researching various things to try web scraping with python, but I'd like to summarize the contents as a memo.

Library to use

Use Beautiful Soup and Requests

--Reference article: [Python] Get the text from the URL with Beautiful Soup and Requests

How to check the execution result of python on atom

Use a package called "script"

--Reference article: [Explanation] How to execute Python on Atom

Code execution shortcut is Ctrl + Shift + B (for windows)

Notes on Beautiful Soup

――It seems that there are various types of parser --Reference article: Introduction to beautifulsoup4 parse and scrape html ――I honestly don't really understand the difference (it seems that there is a difference in speed) ――I think it's okay because most of the sample code uses html.parser lol

--Get_text () to get the string --Reference article: Beautiful Soup in 10 minutes --You can also trim line breaks/blanks, get element names specified, etc. --You can also get a specific string by specifying an element. --For example, if you specify .i.get_text (), only the part enclosed by the <i> element will be acquired. --Reference: Official Reference --You can get the href attribute with get ("href ") --Reference article: [Python] Get href value with Beautiful Soup [Scraping]

Notes on requests

Reference article: How to use Requests (Python Library)

--There is a function corresponding to the method of http request --There is a pattern to get the response content in text format (.text) and in binary format (.content).

About UnicodeEncodeError

I got an error when trying to run the following sample code in a windows environment

from bs4 import BeautifulSoup
import requests as req

url = 'https://www.y-shinno.com/vgg16-finetuning-uecfood100/'
html = req.get(url).content
soup = BeautifulSoup(html, 'html.parser')
text = soup.find(class_='entry-content').get_text()
print(text)

The error returned

UnicodeEncodeError: 'cp932' codec can't encode character '\xa0' in position 1710: illegal multibyte sequence

When I investigated, it seems that it is caused by the batting of the character string encoding method inside python and the standard encoding method of windows.

--Reference article: (Windows) Causes and workarounds for UnicodeEncodeError in Python 3

As a solution, solve it by writing the code described in the following article at the time of import (It took a long time to resolve ...)

--Reference article: Countermeasures when a Japanese encoding error occurs in Python

import io,sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

Should I include this when dealing with Japanese in python on windows? (Please point out if you make a mistake)

How to get and scrape elements

You can use find () or select ()

--Reference article -Basics of CSS Selector for Web Scraping -Differences in how to use find_all () and select () in Beautiful Soup

Note: When examining the class attribute with a developer tool, there are cases where just copying and pasting the source code does not work.

Example

<div class="l-mt10 text-ellipsis__body--2lines contents-list__item-name">AIUEO</div>

When you try to select with this class attribute specified

soup.select(".l-mt10 text-ellipsis__body--2lines contents-list__item-name")

Then it doesn't work

soup.select(".l-mt10.text-ellipsis__body--2lines.contents-list__item-name")

Must be (difficult to understand ...)

How to trim only the character string from the array obtained by select

Combine get_text () and for ～ in ～ (How do you use the for statement!)

Example

print([t.get_text() for t in text])

With this kind of feeling, you can output the result of sequentially get_text () with a for statement for each element of the array.

When dealing with JSON in python

If you want to output the acquired elements as a json array, use the itemgetter of the json module and the operator module, and turn the for statement to arrange them sequentially.

About json module

Reference article:

-Explanation of how to handle JSON in Python -Let's master json dumps in Python! encoding, foramt, datetime

The distinction between functions is confusing ...

--json.load (): Converts the JSON of the file to dictionary type as a result of processing and returns it. --json.loads (): Convert json acquired as a character string on the program to dictionary type and read it --json.dump (): Convert dictionary value to JSON and output to file --json.dumps (): Convert dictionary type value to string type and output

About item getter of operator module

I just looked at it lightly, so only the reference materials are listed.

--Reference: Active engineers explain how to use itemgetter () in Python [for beginners]

How to handle arrays in python