This is the article on the first day of LAPRAS output relay!

Hello! This is @Chanmoro, a LAPRAS crawler engineer! This time, under the title of LAPRAS Output Relay, LAPRAS members will output articles on a daily basis until the end of March! Study sessions and conferences have been canceled due to the recent corona shock, but I hope that this LAPRAS output relay will be of some help to the input and output motivation of engineers.

Content of this article

By the way, I usually work as a crawler developer, but I would like to write this article about the flow of development when developing a new crawler.

Here, as a sample, the information of the article published in LAPRAS NOTE, which is the company's media operated by LAPRAS, is acquired and output to a JSON format file. I would like to introduce an example of crawler implementation.

LAPRAS NOTE is a site that publishes news and interview articles related to LAPRAS for engineers.

Crawler development procedure

When mounting a crawler, we will roughly follow the steps below to investigate, design, and implement.

Investigate the link structure of the site and the leads on each page
Investigate the HTML structure of the page you crawl
Mount the crawler

I will explain each in detail.

1. Investigate the link structure of the site and the leads on each page

If you take a quick look at the LAPRAS NOTE page, a list of articles is displayed on the top page, and the link to the article in the list leads you to the page of each article.

You can see that it is roughly composed of two types of pages.

--Article list page --Article detail page

Let's take a closer look at the creation of these pages.

Check the article list page

On the article list page, you can see that the article title, category, publication date, digest information of the text, and links to the detail page of each article are posted like this.

You can also see that the paging link to the next page is displayed at the bottom of the page.

If you go to the second page, which is the last page at the moment, you can see that the link to the next page is not displayed here.

So, if there is a link to the next page, move to the next page, and if there is no link, it seems that it is the last page.

Examine the article detail page

Next, let's take a look at the contents of the article detail page. From this page, you can see the title, publication date, category, and article body of the article.

For the purpose of getting all articles, it seems that you do not have to think about moving from the article detail page to another page.

Summarize the structure of the data to be extracted

From the site survey I mentioned earlier, I found that it seems likely that these data can be extracted.

--Article --Title

release date --Category
Article Text

Also, I found that in order to extract the above data for all LAPRAS NOTE articles, you should follow the site according to the following flow.

Access the published article list page to get the URL to the article details
If there is a link to the next page, go to the next page and get the URL to the article details as in (1)
Access the article details page to get article information

2. Investigate the HTML structure of the page you crawl

Next, we will investigate how to extract the target data by looking at the HTML structure of the page to be crawled. Here, we will use the developer tools of the web browser.

List of published articles

I want to extract the following from the article list page.

--Link URL of article detail page --URL to link to the next page

Get the link URL of the article detail page

First, find the element that corresponds to one article in order to find the link to the article detail page.

Looking at the HTML structure of the page, I found that the div element with the class of post-item set corresponds to the scope of one article. Also, if you look at the elements in the corresponding div.post-item, you can see that the URL of the article detail page is set in the ʻa tag directly under the h2` tag.

The CSS path to specify this a tag you want to get is div.post-item h2> a.

The current expectation is that on the first page of the article list, you should be able to get ** 10 elements that match this CSS path **, but I would like to see if you can get irrelevant URLs. For example, you can run the following JavaScript code from your browser console to see how many CSS selectors match.

document.querySelectorAll("#main div.post-item h2 > a").length

Actually, if you execute the following from the browser console with the first page of the article list page displayed, you will get the result 10, so you can confirm that the CSS path above seems to be okay.

Get the link URL to the next page

Then look for the link URL to the next page of the article list page.

If you look at the elements of nav.navagation.pagination, you can see that this is the area that shows the links to each page and the links to the next page. You can see that the ʻatag with thenext and page-numbers` classes inside this element is set to the link URL to the next page.

The CSS path to get this would be nav.navigation.pagination a.next.page-numbers. Let's check the number of cases that can actually be obtained from the browser console.

document.querySelectorAll("nav.navigation.pagination a.next.page-numbers").length

When I executed it, I got the result 1, so it seems that it is okay to get the target link URL.

Also, you can see that the element of the link to the next page is not displayed on the second page, which is the final page.

Just in case, I searched for the element of the link to the next page from the console and got the result 0.

Article detail page

I want to extract the following from the article detail page.

--Title

release date --Category
Article Text

As before, look at the HTML structure to find the CSS path to the desired element.

Since the procedure is the same as before, I will omit it, but I found that I should extract the data from the following elements.

--Title - h1

release date
- article header div.entry-meta --Category
- article header div.entry-meta a
Article Text
- article div.entry-content

3. Mount the crawler

The logic for crawling is almost clear from the contents investigated so far, so we will implement it in code. It doesn't matter what language you implement in most cases, but here I'd like to write an example implementation in Python.

Organize what to do

First of all, I will enumerate what to do solidly.

# TODO: https://note.lapras.com/To access

# TODO:Get the URL of the article details from the response HTML
    # TODO:Get if there is a link on the next page

# TODO:Access the article details page

# TODO:Get article information from response HTML
    # TODO: URL
    # TODO:title
    # TODO:release date
    # TODO:Category
    # TODO:Article Text

#Save the retrieved data to a file in JSON format

Precautions when mounting crawlers

As a precaution at the time of implementation, adjust the access interval by putting sleep as appropriate so as not to put an excessive load on the crawl destination service. In most cases, as a guide, it is better to keep it within about 1 request per second at most, but it will be a problem if the corresponding service is downed by crawl, so from the crawl destination Always check if the response is an error.

Solid implementation in Python

For Python, I often write crawlers with a combination of requests and Beautiful Soup. Please also refer to Beautiful Soup in 10 minutes for how to use the library.

There are frameworks that implement crawlers, such as Scrapy, but to get a complete picture of the crawler, first without using the framework. It is recommended to implement it.

If you express the process you want to write in solid code without thinking deeply about the design, it will look like this.

import json
import time

import requests
from bs4 import BeautifulSoup


def parse_article_list_page(html):
    """
Parse the article list page and extract data
    :param html:
    :return:
    """
    soup = BeautifulSoup(html, 'html.parser')
    next_page_link = soup.select_one("nav.navigation.pagination a.next.page-numbers")

    return {
        "article_url_list": [a["href"] for a in soup.select("#main div.post-item h2 > a")],
        "next_page_link": next_page_link["href"] if next_page_link else None
    }


def crawl_article_list_page(start_url):
    """
Crawl the article list page to get all the URLs of the article details
    :return:
    """
    print(f"Accessing to {start_url}...")
    # https://note.lapras.com/To access
    response = requests.get(start_url)
    response.raise_for_status()
    time.sleep(10)

    #Get the URL of the article details from the response HTML
    page_data = parse_article_list_page(response.text)
    article_url_list = page_data["article_url_list"]

    #Get if there is a link on the next page
    while page_data["next_page_link"]:
        print(f'Accessing to {page_data["next_page_link"]}...')
        response = requests.get(page_data["next_page_link"])
        time.sleep(10)
        page_data = parse_article_list_page(response.text)
        article_url_list += page_data["article_url_list"]

    return article_url_list


def parse_article_detail(html):
    """
Parse the article detail page to extract data
    :param html:
    :return:
    """
    soup = BeautifulSoup(html, 'html.parser')
    return {
        "title": soup.select_one("h1").get_text(),
        "publish_date": soup.select_one("article header div.entry-meta").find(text=True, recursive=False).replace("｜", ""),
        "category": soup.select_one("article header div.entry-meta a").get_text(),
        "content": soup.select_one("article div.entry-content").get_text(strip=True)
    }


def crawl_article_detail_page(url):
    """
Crawl the article detail page to get the article data
    :param url:
    :return:
    """
    #Access article details
    print(f"Accessing to {url}...")
    response = requests.get(url)
    response.raise_for_status()

    time.sleep(10)
    #Get article information from response HTML
    return parse_article_detail(response.text)


def crawl_lapras_note_articles(start_url):
    """
Crawl LAPRAS NOTE to get all article data
    :return:
    """
    article_url_list = crawl_article_list_page(start_url)
    article_list = []
    for article_url in article_url_list:
        article_data = crawl_article_detail_page(article_url)
        article_list.append(article_data)
    return article_list


def collect_lapras_note_articles():
    """
Get all the data of LAPRAS NOTE articles and save it in a file
    :return:
    """
    print("Start crawl LAPRAS NOTE.")
    article_list = crawl_lapras_note_articles("https://note.lapras.com/")

    output_json_path = "./articles.json"
    with open(output_json_path, mode="w") as f:
        print(f"Start output to file. path: {output_json_path}")
        json.dump(article_list, f)
        print("Done output.")

    print("Done crawl LAPRAS NOTE.")


if __name__ == '__main__':
    collect_lapras_note_articles()

The implemented code is published in this repository. https://github.com/Chanmoro/lapras-note-crawler

There is an atmosphere like a disposable cord, but for the Basic edition, I would like to finish the explanation once.

Summary

By the way, this time, "How to make a crawler --Basic edition", with the theme of crawler implementation of LAPRAS NOTE, a series of flow when developing a crawler I introduced the basics.

When I develop a crawler, I follow the procedure introduced here. The code given in the implementation example is often still lacking when viewed as a crawler that continues to maintain for a long period of time, but when it is sufficient to acquire a set of data at one time instead of continuously updating the data. I think that such a code is enough for simple use. Personally, if the crawler is enough for research to be executed a few times, it may be easily implemented with a shell script without using Python.

In real work, I spend much more time developing data modeling, crawl flow design, handling in case of errors, etc. to keep the data intact with continuous crawls. I plan to write some articles during the LAPRAS output relay period, but in my next article, "How to make a crawler-Advanced", based on the contents of this article, long-term maintenance What should we be careful about when designing a crawler that keeps going? I would like to introduce that, so please look forward to it!

@Nasum will write about the second day of tomorrow's LAPRAS output relay! Please look forward to this too!