Scrapy is good talk

Introduction

Hello. This article is MYJLab Advent Calendar Day 16. It's been two days in a row, so I'm a little short of breath, but I'd like to work hard to teach my recommendation tool. In the previous article, I did Introduction of Jupyter Lab which is an analysis environment. It's a really good tool, so please use it. Today, I would like to introduce the tools used for data collection before starting analysis.

About scraping and crawling

How do you collect data when you collect data by machine learning etc.? If you develop it yourself, you can't buy the data, but you can't create the data from scratch. In my case, if I have a problem with the data, I try to scrape it for the time being. [Scraping](https://ja.wikipedia.org/wiki/%E3%82%A6%E3%82%A7%E3%83%96%E3%82%B9%E3%82%AF%E3%83 % AC% E3% 82% A4% E3% 83% 94% E3% 83% B3% E3% 82% B0) is a technology that extracts necessary information from a web page. A similar technique is crawling, but the difference between scraping and crawling varies from person to person. In this article, we classify them as follows.

Usually, the information you want to extract is not completed in one web page, so crawling and scraping are usually done together. For example, suppose you want to collect Yahoo News data to make false news in a Markov chain. Yahoo News pages can be divided into two main pages.

  1. A page listing news articles
  2. Page displaying news details

If you want to collect all the news, do the following:

  1. Get a list page of news articles
  2. Get the link of the page to display the news details from the list display page Access the link obtained in 3.2
  3. Get the information you need
  4. Repeat steps 2-4

By performing this process, you can get the information of the news you want. However, when these tasks are coded, I, a small fish engineer, produce code that is not readable. And that's not the only thing you need to be aware of when crawling.

I'm not an engineer who can write code neatly while being aware of these things. However, these problems can be easily solved with ** Scrapy **.

About Scrapy

Scrapy is a framework for web crawling and web scraping in Python. It's a great guy that can do everything from monitoring to automated testing. It does the tedious processing of crawling and scraping as mentioned above. In addition, it automatically creates a code template for scraping, so you can write a unified (?) Code.

How Scrapy works

To be honest, the fastest way to use Scrapy is to have a look at Tutorial, so I will omit it. Here, I would like to do my best to explain the mechanism of Scrapy, which was difficult to understand personally. Scrapy works with the following mechanism.

Screenshot from Gyazo

I think this figure is a little difficult to understand, so I will do my best to explain it. Scrapy is made up of six main parts.

Engine The Engine is responsible for controlling the flow of Scrapy data. Scrapy is written in twisted, an event-driven network programming framework. The engine triggers an event when a particular action occurs.

Scheduler Scheduler is a part that stores requests received from the engine and controls the timing. In the figure above, ②③⑧ is the job of the scheduler.

Downloader Downloader's job is to get the web page and pass it to Spider through the Engine. Access to the website is always done through this Downloader.

Spider Spider is a part that developers mainly modify, and it extracts and saves elements. Scrapy manages data in units called Items, and if you return an Item object in Spider, data saving will start, and if you return a Request object, it will be crawled again.

Item pipeline The Item pipeline is responsible for processing the items extracted by Spider. It is the role of the Item pipeline to store and cleanse items in MySQL.

Middleware Middleware is the part that is located between the interactions between each part (the dark blue part in the figure above). There are two types of Middleware, Download Middleware and Spider Middleware, each of which has a different role.

At the end

It's a disorganized article, but I hope you'll feel Scrapy somehow. There are many other useful features in Scrapy, so if you're interested, check it out.

Recommended Posts

Scrapy is good talk
Bakthat is good for backup, isn't it?