I'm researching various things to try web scraping with python, but I'd like to summarize the contents as a memo.
Use Beautiful Soup and Requests
--Reference article: [Python] Get the text from the URL with Beautiful Soup and Requests
Use a package called "script"
--Reference article: [Explanation] How to execute Python on Atom
Code execution shortcut is Ctrl + Shift + B
(for windows)
――It seems that there are various types of parser
--Reference article: Introduction to beautifulsoup4 parse and scrape html
――I honestly don't really understand the difference (it seems that there is a difference in speed)
――I think it's okay because most of the sample code uses html.parser
lol
--Get_text () to get the string
--Reference article: Beautiful Soup in 10 minutes
--You can also trim line breaks/blanks, get element names specified, etc.
--You can also get a specific string by specifying an element.
--For example, if you specify .i.get_text ()
, only the part enclosed by the <i>
element will be acquired.
--Reference: Official Reference
--You can get the href attribute with get ("href ")
--Reference article: [Python] Get href value with Beautiful Soup [Scraping]
Reference article: How to use Requests (Python Library)
--There is a function corresponding to the method of http request
--There is a pattern to get the response content in text format (.text
) and in binary format (.content
).
I got an error when trying to run the following sample code in a windows environment
from bs4 import BeautifulSoup
import requests as req
url = 'https://www.y-shinno.com/vgg16-finetuning-uecfood100/'
html = req.get(url).content
soup = BeautifulSoup(html, 'html.parser')
text = soup.find(class_='entry-content').get_text()
print(text)
The error returned
UnicodeEncodeError: 'cp932' codec can't encode character '\xa0' in position 1710: illegal multibyte sequence
When I investigated, it seems that it is caused by the batting of the character string encoding method inside python and the standard encoding method of windows.
--Reference article: (Windows) Causes and workarounds for UnicodeEncodeError in Python 3
As a solution, solve it by writing the code described in the following article at the time of import (It took a long time to resolve ...)
--Reference article: Countermeasures when a Japanese encoding error occurs in Python
import io,sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
Should I include this when dealing with Japanese in python on windows? (Please point out if you make a mistake)
You can use find ()
or select ()
--Reference article -Basics of CSS Selector for Web Scraping -Differences in how to use find_all () and select () in Beautiful Soup
Note: When examining the class attribute with a developer tool, there are cases where just copying and pasting the source code does not work.
Example
<div class="l-mt10 text-ellipsis__body--2lines contents-list__item-name">AIUEO</div>
When you try to select with this class attribute specified
soup.select(".l-mt10 text-ellipsis__body--2lines contents-list__item-name")
Then it doesn't work
soup.select(".l-mt10.text-ellipsis__body--2lines.contents-list__item-name")
Must be (difficult to understand ...)
Combine get_text ()
and for ~ in ~
(How do you use the for statement!)
Example
print([t.get_text() for t in text])
With this kind of feeling, you can output the result of sequentially get_text ()
with a for statement for each element of the array.
If you want to output the acquired elements as a json array, use the itemgetter
of the json
module and the operator
module, and turn the for statement to arrange them sequentially.
Reference article:
-Explanation of how to handle JSON in Python -Let's master json dumps in Python! encoding, foramt, datetime
The distinction between functions is confusing ...
--json.load (): Converts the JSON of the file to dictionary type as a result of processing and returns it. --json.loads (): Convert json acquired as a character string on the program to dictionary type and read it --json.dump (): Convert dictionary value to JSON and output to file --json.dumps (): Convert dictionary type value to string type and output
I just looked at it lightly, so only the reference materials are listed.
--Reference: Active engineers explain how to use itemgetter () in Python [for beginners]
I just looked at it lightly, so only the reference materials are listed.
--Reference: python array basics are now perfect! Introducing many useful methods
Recommended Posts