Learning notes from Chapters 1 to 3 of "Scraping & Machine Learning with Python". Topics related to scraping are from Chapter 1 to Chapter 3. The fourth and subsequent chapters are the machine learning part.
A package of modules that handle URLs. Below is an example of the method.
--urlretrieve () ・ ・ ・ Download data directly (file is saved locally) --urlopen () ・ ・ ・ Get in memory. If you want to get it by ftp, just change the parameter https: // passed to urlopen () to ftp: //
If you want to send a request with get parameter, create key / value parameter data with dictionary type variable.
Use the urllib.parse module to url-encode variables. Add the encoded variable to the url string (don't forget the "?" In between).
Import the sys module to get command line arguments.
A library that parses HTML and XML. Data cannot be downloaded. If you want to download it, use urllib.
Python package management system.
Abbreviation for Python Package Index.
--Trace the hierarchy from the tag using dots (.) --Find by id using find () method --Use the find_all () method to get all the elements specified by the parameters --Use css selector
If you know the HTML structure and the basics of css, you can basically get any data. However, if the page structure changes, it needs to be corrected.
Example: Aozora Bunko's Natsume Soseki page https://www.aozora.gr.jp/index_pages/person148.html
The result obtained by css selector for the li tag at the top of the work list is as follows.
body > ol:nth-child(8) > li:nth-child(1)
nth-child (n) ・ ・ ・ Meaning of the nth element
Looking at the work page, the \
If you write an elegant CSS selector, you can retrieve a specific element in one shot.
** It is important to remember the format of the selector. Same as remembering regular expressions ** </ font>
The find () method is characterized by being able to specify multiple conditions at once.
It is also possible to extract elements in combination with regular expressions.
If the link destination of the \ tag is a relative path, convert it to an absolute path using the urllib.parse.urljoin () method.
To download the whole thing, you need to download the link recursively.
To use a regular expression, import the re module.
A package called requests is convenient for access using cookies.
Start a session with the requests.session () method.
To check the data sent at login, use the developer tool of the browser.
Check from the "Network" tab of the developer tools. To see the submitted form data, check "Form Data" on the "Header" tab.
"Selenium" is famous as a tool for remotely controlling a web browser.
If you use it from the command line headless (no screen display), the browser will not start up one by one.
In addition to Chrome, Firefox, Opera, etc., iOS and Android browsers can also be operated.
When accessing with selenium, it is the same as accessing with a browser, so session management is not required.
You can do quite a lot with selenium. Most things that people do with a browser can be done automatically.
Furthermore, if you use the execute_script () method, you can execute any js.
--You can freely manipulate DOM elements in HTML pages → It is possible to use it such as removing decorative elements unrelated to the element you want to acquire in advance. --You can call the Javascript function in the page at any time → You can get any data on the page
Wikipedia prohibits crawling, so direct scraping is NG. There is a site where you can get dump data instead, so use that data.
for row in result:
print("," .join(row))
--find_all () ・ ・ ・ If you give a list to the method, you can get multiple tags at once. --find_elements_by_css_selector () ・ ・ ・ "elements" and ** plural **, so get multiple elements at once --find_element_by_css_selector () ・ ・ ・ "element" and ** singular **, so get only one element at a time. If you call this method with the intention of acquiring multiple methods, it is not a grammatical error, but it does not work as expected, so be careful.
You can also take a screenshot using the browser.save_screenshot () method. This is useful when you want to know what the actual screen looks like when operating in headless mode.
There are many parts of scraping that can only be understood by actually testing it. Think about what kind of operation is possible while analyzing the actual screen (HTML).
** It is important to understand the structure of the site. Also, knowledge of CSS is required. ** </ font>
A function that a site has is published so that it can be used from the outside. Exchange via HTTP communication and acquire data in XML or JSON format.
Be aware that the specifications of Web API may change due to the convenience of the operation side.
A part of the character string can be changed later as a variable value.
(Example)
str = "hogehoge{name}fugafuga"
str.format(name="Eric Clapton")
** Function name = lambda argument: Processing content ** A function that can be written in the format. (Example)
k2c = lambda k: k - 273.5
On macOS and Linux, a daemon process called "cron" is used. In Windows, use "task scheduler".
A program that resides in main memory on a UNIX-like OS and provides specific functions. A type of background process that processes independently of user operations.
** Main periodic execution processing **
To set cron, execute the "crontab" command and edit the file opened there. If you want to edit cron on mac, nano editor is convenient.
A framework for crawling and scraping.
** Basic work flow **
Create a subclass that inherits the Spider class. The location is the spiders directory.
--parse () ・ ・ ・ Describes the analysis process of the text to be performed after acquiring the data. --css () ・ ・ ・ Extract DOM elements using CSS selectors --extract () ・ ・ ・ Get multiple elements contained in it in list format --extract_first () ・ ・ ・ Method that returns the first element included in the result
Scrapy execution command example
scrapy crawl soseki --nolog
If "--nolog" is described, the operation log is omitted. If not attached, the operation log will be output to the console.
The return value of the method is yield. The rule is to return by yield instead of return.
A shell that can run Scrapy interactively. This is useful when verifying whether data can be acquired correctly with CSS selectors.
A command to create a subclass of the Spider class.
scrapy genspider soseki3 www.aozora.gr.jp
The parse () method is called automatically after getting the URL specified in start_urls.
Use the response.follow () method to get the linked page.
To download the file, use the scrapy.Request () method. You can specify the method after processing is completed in the callback parameter.
Scrapy can extend its functionality by introducing middleware. You can introduce that mechanism and incorporate Selenium.
** Format when specified ** (Project directory name). (Middleware file name). (Middleware class name)
#Example of middleware registration
custom_settings = {
"DOWNLOADER_MIDDLEWARES": {
"sakusibbs.selenium_middleware.SeleniumMiddleware": 0
}
}
The start_requests () method is a method that defines the process that is automatically executed immediately before the request.
The data distributed on the Web is roughly divided into two types: text data and binary data.
--Text data Format example: text file, XML, JSON, YAML, CSV For text data, it is necessary to program with consideration for character code and encoding.
--Binary data Format example: Image (png, jpeg, gif, etc.), Excel format The data size is smaller than the text data.
Please note that the URL of the disaster prevention related data of Yokohama City in the book has been changed. As of October 22, 2020: https://www.city.yokohama.lg.jp/kurashi/bousai-kyukyu-bohan/bousai-saigai/bosai/data/data.files/0006_20180911.xml
Note that all uppercase letters are converted to lowercase when parsing XML with BeautifulSoup.
(Example)
<LocationInformation>
<Type>Regional disaster prevention base</Type>
<Definition>It is a base equipped with a place for evacuation of the affected residents, information transmission and transmission, and stockpiling functions.</Definition>
<Name>Namamugi Elementary School</Name>
<Address>4-15 Namamugi, Tsurumi-ku, Yokohama-shi, Kanagawa-1</Address>
<Lat>35.49547584</Lat>
<Lon>139.6710972</Lon>
<Kana>Namamugi Shogakko</Kana>
<Ward>Tsurumi Ward</Ward>
<WardCode>01</WardCode>
</LocationInformation>
If you want to get "Location Information" from the above data, it will be as follows.
Wrong: soup.find_all ("LocationInformation"): Correct: soup.find_all ("location information"):
When dealing with Excel files, install xlrd in addition to openpyxl.
Python supports various DBs. SQLite is built into the standard library and can be used immediately by importing sqlite3.
Installation is required in advance to use MySQL. Install with apt-get in Linux environment. For Mac and Windows, it is convenient to install MAMP, which also bundles the MySQL management tool "phpMyAdmin".
Difference in how to specify variable items in SQL.
--SQLite ...? --MySQL ・ ・ ・% s
One of the document type databases. A relational database management system (RDBMS) requires a schema definition using CREATE TABLE, but a document database does not require a schema definition.
A library for using document-oriented databases. It is easier to use from python than MongoDB. (MongoDB requires the installation of MongoDB itself, but TinyDB can be used by installing the package using pip)
If you're dealing with data of a certain size, MongoDB is a better choice.
Recommended Posts