LAPRAS Output Relay This is the article on the 7th day! Hello! This is @Chanmoro, a LAPRAS crawler engineer!
The other day, I wrote an article How to make a crawler --Basic, but in this article, I wrote an article entitled "How to make a crawler --Advanced". I would like to briefly introduce what kind of problems you face when developing a crawler in earnest, and what kind of design the crawler should make into a maintainable crawler that can easily deal with these problems. I will.
By the way, you can implement the minimum crawler by the method introduced in How to make a crawler --Basic, but from there, you can implement the minimum crawler more regularly. When you start the operation of repeatedly crawling and updating data, you will usually face the following problems.
--The HTML structure of the crawl destination changes --Crawl destination failure or temporary error --id changes
For crawler developers, all of them are "yes" problems, but at least these issues must be cleared in order to keep the crawler in operation.
The service you crawl is constantly changing, and one day the HTML structure may suddenly change and you may not be able to get the data you want. Depending on the design, it is quite possible that a strange value will be obtained and the database will be overwritten, destroying all the data accumulated so far.
Here's how I've dealt with this challenge so far:
--Validate the value obtained by crawling --Periodically execute a test to observe the crawl destination service at a fixed point
This, perhaps the first thing you might think of, is a way to implement validation on the data retrieved from the crawl destination. You can prevent strange data from entering the database by implementing validation rules for data whose format and type can be limited. However, the weakness of this method is that it is not possible to define validation rules for data for which a specific format cannot be determined or data that is not a required item, so there is a disadvantage that it can be applied only to some items.
This method is a method to detect changes in the HTML structure by periodically running a test to see if the data obtained by actually accessing the fixed point observation = crawl destination service is the expected value. You can implement it by writing a test for a specific account or page and running it on a regular basis.
Scrapy has a feature called contract, which allows you to actually access a specified URL and write a test for the response obtained, which gave me an idea.
However, this method is useful if you can create an account for fixed point observation or prepare data by yourself, otherwise the test will fail every time the target account or page is updated. .. (Depending on the details of the test) In addition, you cannot write rigorous tests for data that changes frequently, such as currency exchange data.
I've shown you two ways to address the problem of changing the HTML structure you crawl to, but of course these aren't all.
It is possible that an error response will be returned temporarily when a failure occurs on the crawl destination service side. This can be a problem if you are branching the crawler's processing by looking at the status code of the response.
For example, if the response status is 404, it is determined that the corresponding data has been deleted, and the process to delete the corresponding data is implemented. At this time, if the crawl destination service side temporarily returns 404 due to a specification bug or some error, the corresponding data is not actually deleted, but the crawler side mistakenly determines that it has been deleted. I have a problem.
To deal with this problem, it is effective to wait for a while and retry to distinguish whether it is a temporary response.
When crawling a service that is designed to include the ID to be crawled in the URL and the specification is such that the ID can be changed later, you may want to treat the changed id as the same as the data before the change. There is. For example, a service crawl that allows you to change the user's ID or change the URL of data once posted.
Some services will redirect you 301, so in this case you can compare the old URL with the new URL to see what the ids are. In this case, it's relatively easy to deal with, and you can follow it by getting the id contained in the URL after the 301 redirect and updating the data. Note that ʻid` in the crawl destination data is variable and should not be treated as an id on the crawler system.
Also, depending on the thing, the old URL will be 404 and you may not know the correspondence to the new URL, so you have to delete the data of the old URL and wait for the new data of the new URL to be added. It is also possible.
Now, I imagine that most crawlers will face the three problems I've introduced so far.
[Introduced code](https://qiita.com/Chanmoro/items/c972f0e9d7595eb619fe#python-%E3%81%A7%E3%83%99%E3%82%BF%E3%81%AB%E5%AE % 9F% E8% A3% 85% E3% 81% 99% E3% 82% 8B) was a fairly sticky implementation, so I don't know where to look for an implementation that addresses the issues introduced here. Hmm.
So, for example, let's divide the crawler processing into several layers as follows.
import json
import time
import dataclasses
from typing import List, Optional
import requests
from bs4 import BeautifulSoup
@dataclasses.dataclass(frozen=True)
class ArticleListPageParser:
@dataclasses.dataclass(frozen=True)
class ArticleListData:
"""
A class that represents the data retrieved from the article list page
"""
article_url_list: List[str]
next_page_link: Optional[str]
@classmethod
def parse(self, html: str) -> ArticleListData:
soup = BeautifulSoup(html, 'html.parser')
next_page_link = soup.select_one("nav.navigation.pagination a.next.page-numbers")
return self.ArticleListData(
article_url_list=[a["href"] for a in soup.select("#main div.post-item h2 > a")],
next_page_link=next_page_link["href"] if next_page_link else None
)
@dataclasses.dataclass(frozen=True)
class ArticleDetailPageParser:
@dataclasses.dataclass(frozen=True)
class ArticleDetailData:
"""
A class that represents the data retrieved from the article detail page
"""
title: str
publish_date: str
category: str
content: str
def parse(self, html: str) -> ArticleDetailData:
soup = BeautifulSoup(html, 'html.parser')
return self.ArticleDetailData(
title=soup.select_one("h1").get_text(),
publish_date=soup.select_one("article header div.entry-meta").find(text=True, recursive=False).replace("|", ""),
category=soup.select_one("article header div.entry-meta a").get_text(),
content=soup.select_one("article div.entry-content").get_text(strip=True)
)
@dataclasses.dataclass(frozen=True)
class LaprasNoteCrawler:
INDEX_PAGE_URL = "https://note.lapras.com/"
article_list_page_parser: ArticleListPageParser
article_detail_page_parser: ArticleDetailPageParser
def crawl_lapras_note_articles(self) -> List[ArticleDetailPageParser.ArticleDetailData]:
"""
Crawl LAPRAS NOTE to get all article data
"""
return [self.crawl_article_detail_page(u) for u in self.crawl_article_list_page(self.INDEX_PAGE_URL)]
def crawl_article_list_page(self, start_url: str) -> List[str]:
"""
Crawl the article list page to get all the URLs of the article details
"""
print(f"Accessing to {start_url}...")
# https://note.lapras.com/To access
response = requests.get(start_url)
response.raise_for_status()
time.sleep(10)
#Get the URL of the article details from the response HTML
page_data = self.article_list_page_parser.parse(response.text)
article_url_list = page_data.article_url_list
#Get if there is a link on the next page
while page_data.next_page_link:
print(f'Accessing to {page_data.next_page_link}...')
response = requests.get(page_data.next_page_link)
time.sleep(10)
page_data = self.article_list_page_parser.parse(response.text)
article_url_list += page_data.article_url_list
return article_url_list
def crawl_article_detail_page(self, url: str) -> ArticleDetailPageParser.ArticleDetailData:
"""
Crawl the article detail page to get the article data
"""
#Access article details
print(f"Accessing to {url}...")
response = requests.get(url)
response.raise_for_status()
time.sleep(10)
#Get article information from response HTML
return self.article_detail_page_parser.parse(response.text)
def collect_lapras_note_articles_usecase(crawler: LaprasNoteCrawler):
"""
Get all the data of LAPRAS NOTE articles and save it in a file
"""
print("Start crawl LAPRAS NOTE.")
article_list = crawler.crawl_lapras_note_articles()
output_json_path = "./articles.json"
with open(output_json_path, mode="w") as f:
print(f"Start output to file. path: {output_json_path}")
article_data = [dataclasses.asdict(d) for d in article_list]
json.dump(article_data, f)
print("Done output.")
print("Done crawl LAPRAS NOTE.")
if __name__ == '__main__':
collect_lapras_note_articles_usecase(LaprasNoteCrawler(
article_list_page_parser=ArticleListPageParser(),
article_detail_page_parser=ArticleDetailPageParser(),
))
The code is here. https://github.com/Chanmoro/lapras-note-crawler/blob/master/advanced/crawler.py
By separating into three layers, parser, crawler, and usecase, it becomes clearer to make the following changes to address the problem introduced earlier.
--The HTML structure of the crawl destination changes --Validate in the parser layer or write a fixed point observation test --Crawl destination failure or temporary error --Return contextual exceptions and return values in the crawler layer --Retry or branch flow in usecase layer --id changes --Add information that shows that the id has changed to the return value from the crawler --Handle the match and update logic of captured data in the usecase layer
By the way, in this article, entitled "How to make a crawler-Advanced", we have developed problems that often occur when operating crawlers continuously and what kind of crawler design should be designed to deal with them. I wrote about how easy it is.
This is just an example for clarity, so I think it is necessary to further devise the design to meet the characteristics of the crawl destination service and the requirements of the service that wants to use the crawled data. I will.
However, such a design is not limited to crawlers, and most of the talk is about the scope of general data modeling of systems that have data linkage with external services, APIs, and libraries, so the more you develop crawlers, the more you actually talk about it. Isn't there a lot of crawler-specific designs? I feel like that.
Crawler development based on my experience of developing crawlers twice, following the previous How to make a crawler --Basic I have introduced the actual situation of.
I hope it will be useful for those who are having trouble developing crawlers now and those who want to develop crawlers in the future!
Let's enjoy a good crawler development life!
Recommended Posts