** This article is the second summary article that I investigated for the purpose of using python for investment utilization, following the previous SQLite. ** ** ** I will share what I researched about scraping twice in total. For the first time ①, I tried to summarize what I need to know before scraping. ** **
** ・ What is scraping? ** ** Extract the required information from the website.
** ・ What is crawling? ** ** Follow links to collect web pages.
** ・ What is a crawler? ** ** A crawler is a program that patrols the Internet and collects and saves data such as websites, images, and videos. For example, Web search engines such as Google and Bing use crawlers to collect Web pages from all over the world in advance, so they can search at high speed.
** Web pages are basically copyrighted works ・ Some websites clearly prohibit crawling in the "Terms of Service" and "Help Page". · Do not crawl pages that are rejected by robots.txt or robots meta tags ・ Even if it is permitted, be careful not to put a burden on the Web server. **
** ・ What is robot.txt? ** ** robots.txt is a file that is set up to inform the crawler of access restrictions. You can check robot.txt by putting "/robots.txt" in the URL of the root page (top page of the Web). (Yahoo) root page: https://www.yahoo.co.jp/ (Yahoo) robots.txt: https://www.yahoo.co.jp/robots.txt
** ・ What are (robots) meta tags? ** ** It is used for a purpose similar to robots.txt, and is written in the header part of the HTML file, "something like a description of the site".
** ・ What is Perth? ** ** Parsing data written according to a certain format or grammar to see if the syntax matches the grammar
** ・ What is a parser? ** ** A program that analyzes structural character data and converts it into a collection of data structures that can be handled by the program.
There is a python library called urllib.robotparser
, but I felt that reppy
was better for usability, so I will use it here.
** * However, cases where Robots.txt is not described according to the rules may not be read well, so in that case it is necessary to check by adding /robots.txt directly to the URL or by another method. There is **
** How to use reppy's parser library **
(・ Install reppy with pip install reppy
)
-Which crawler should be checked by specifying the user-agent with the fetch` instruction? To specify.
-Is it accessible by specifying the URL to be crawled with the allowed instruction? To confirm
-Check the Crawl-delay (the crawl interval specified by the site). * If this is specified, it is necessary to follow it.
Below, the homepage of MSN is taken as an example.
python
from reppy.robots import Robots
#msn robots with fetch.Read txt
robots = Robots.fetch('https://www.msn.com/robots.txt')
#Wildcard to check('*')Specified by * In other words, all crawlers are targeted
agent = robots.agent('*')
#Robots using allowed for each specified URL.Is txt accessible? To confirm
print(agent.allowed('https://www.msn.com/ja-jp/news'))
print(agent.allowed('https://www.msn.com/ja-jp/health/search/filter'))
Execution result
True
False
It can be confirmed that the first site is OK and the second site is NG.
We will continue to introduce cases where there is a crawl interval specified by the site. Take the home page of Cloudworks as an example.
python
from reppy.robots import Robots
#Crowdworks robots with fetch.Read txt
robots = Robots.fetch('https://crowdworks.jp/robots.txt')
#Check target bingbot(Bing)Specify with and check with delay * bingbot ≒ Microsoft search crawler-
agent = robots.agent("bingbot")
print(agent.delay)
#Check wildcards for comparison
agent = robots.agent("*")
print(agent.delay)
Execution result
10.0
None
You can see that the Bing crawler is written with a 10 second crawl interval.
Also check access restrictions by HTML tags and HTTP headers. If there is a description such as noindex, nofollow, etc. in the meta tag or a tag of the site page, it is prohibited to index or follow the link of the site.
Specify the URL where you want to check the meta tag using BeautifulSoup4
pip install beautifulsoup4
.python
import requests
from bs4 import BeautifulSoup
#request.get()Get the web information specified in
url = requests.get("https://www.yahoo.co.jp/")
#Create a BeautifulSoup object(Convert the HTML acquired by text into characters and html.Analyze with parser)
soup = BeautifulSoup(url.text, "html.parser")
# soup = BeautifulSoup(url.content, "html.parser")
#The first matching tag in find<meta>And pass the attribute value robots of the name attribute to attrs
robots = soup.find("meta", attrs={'name': 'robots'})
print(robots)
Execution result
<meta content="noodp" name="robots"/>
Apparently yahoo has a Meta tag called noodp (NO Open Directory Procject).
For reference, I searched using the method that introduced the sites that are often talked about in investment scraping. ** The conclusions regarding scraping OK/NG for each HP are not mentioned/stated in this article. What I felt while investigating was that basically finance-related businesses seem to have a lot of NG. ** There are cases where there is a separate API, so it may be a good idea to use it. .. ..
** Example ①: Stock investment memo ** https://kabuoji3.com/ ・ Robots.txt: Allow:/* Allow everything under the root directory ・ Meta tag: Not listed ・ Homepage rules: Not stated
: and/in
Allow:/`. You can get it by other means, but I will omit it this time.** Example ②: Stock search ** https://kabutan.jp/ -Robots.txt: Disallow: /94446337/ * 94449667 directory and below are not allowed ・ Meta tag: Not listed ・ Homepage rules: It is prohibited to apply an unreasonable load (see below)
Article 4 (** Prohibitions **)
** Example ③: Shikiho Online ** https://shikiho.jp/ ・ Robots.txt: Not listed ・ Meta tag: Not listed ・ Homepage rules: Prohibition (see below)
Article 13 (User's other ** prohibited acts **)
** Example ④: Yahoo! Finance ** https://finance.yahoo.co.jp/ ・ Robots.txt: Not listed ・ Meta tag: Not listed ・ Homepage rules: It is prohibited to destroy or interfere with network functions (see below).
● Terms 7. Compliance items when using the service When using our services, the following acts (including acts that induce them and preparatory acts) are ** prohibited **. : ** 4. Acts that destroy or interfere with the functionality of our servers or networks **
● Yahoo! Finance Help ** Automatic acquisition (scraping) of Yahoo! Finance posted information is prohibited. ** **
Yahoo! Finance Agreement Yahoo! Finance Help
It seems that the rules have not been decided. It seems that it will not be a load on the service of the other party, but now it seems that the guideline of once per second is recognized if it is not specified by Crawl-delay. (Did it spread in the Okazaki Municipal Central Library case?)
Since it also serves as a personal memorandum, it may have been a difficult article for beginners, but it is important, so I summarized it. In summary, there are many gray areas in this problem, and I thought that there was no clear rule when it was not clearly stated as NG. Anyway, it is NG if the other party's server operation is inconvenienced, so it is important to give up if you do not understand well.
It also seems necessary to consider that problems may occur depending on how the data collected by the crawler is used (as another problem).
** Next article on specific scraping. ** **
Recommended Posts