I often see crawlers in my work, but this time I'm going to use Django + Celery for the time being if I make a new crawler, so I'll try to write a sample and write what I thought.
Even though it is a crawler, there are many conversations with slightly different definitions depending on the person, so for the time being, the target of this time is
--Collect web page links according to certain conditions --Scraping the page content in each link --The processing after scraping is excluded this time.
I will talk on the premise.
y-matsuwitter/django_sample_crawler
This is the structure for writing the sample,
I created the one that implements the following using.
--Can be scaled to multiple servers --Crawl the trending repository on github --Crawl conditions can be added later --Do not access Github at a certain pace ――Easy to retry
If the crawler is in operation for a long period of time, only part of the crawler's processing may become bloated and slow. For example, the CPU is almost exhausted in the content extraction section. At this time, it is necessary to take measures such as separating only this process, but I think that it is necessary to get on the Message Queue system from the beginning to simplify such distributed operation. This time, Celery is used to process this layer. As an internal process of getting github trending this time,
It is executed in the order of. Each process is treated as a separate celery task, and if you adjust the worker settings, you can assign a different server cluster for each process. By the way, it is a big division for the appropriate crawler this time, but by specifying each process separately as a task, it will be easier to respond to retry when a certain task fails. For example, if you download the trending page but the rule is inadequate at the parsing stage and it stops, you can recover by modifying only the target task and running only that task again, and the task is jammed at that layer. Even if you do, you can easily add resources to that layer.
Celery has a mechanism called periodic task, and you can also specify the task to be executed periodically as periodic_task. This time, the part that operates the crawler of each rule, which is the most basic, is specified by schedule.
from celery.schedules import crontab
CELERYBEAT_SCHEDULE = {
'run_all_crawler_schedule': {
'task': 'crawler.tasks.run_all_crawler',
'schedule': crontab(minute='*/2'),
},
}
By the way, the fact that it is executed every two minutes is just written appropriately in debugging, so if you really want to use it, leave a little more space.
rate_limit When scraping the site in detail, the html acquisition process may put a load on the target site. celery has a mechanism called rate_limit, and you can specify how often it should be executed for each task.
@celery_app.task(rate_limit="60/m")
def fetch_trending(crawler_id):
pass
In the above, you can restrict the operation so that it does not operate more than 60 times per minute.
You can create a normal crawler with Celery alone, but in long-term operation, the addition of rules and sites will steadily increase, so while adding such rules to the DB, you can check the situation on the management screen on the Web side. If you consider things like checking, the combination with Django will be convenient. This time it's so simple that the crawler's rules are similar to the trending acquisition rules, but you can set this information in django admin.
If you add a rule later, perhaps the most problematic thing is around the scraping rule, which can be dealt with by saving the xpath or css selector as a rule. You may also want to refer to how Scrapy works.
I feel that I don't have much knowledge on the internet to operate a relatively large-scale crawler, so I would like to have a study session to share the knowledge in this area.
Recommended Posts