What is infinite scrolling?

You can see it on Facebook and Twitter timelines, scroll to the bottom of the page and it will load new information.

Motivation

The reason I decided to crawl the infinite scroll page was because I had to pull in past tweets on Twitter because of school issues. Well, you say that Twitter has an official API. The official Twitter API isn't very kind, and it's designed so that you can't get tweets older than a week **. In other words, if you want to get older tweets, you have to crawl yourself. And since Twitter search results are displayed with ** infinite scroll **, you have to crawl the page that scrolls infinitely.

Why it's difficult to crawl infinite scrolling

The crawler basically works as follows:

Get the HTML response from the given url and process it
Find the url to crawl further in the response
Do 1-2 again with the new url

In this way, a large amount of data is fetched from the net. The problem with crawling infinite scrolling pages is paging (how to navigate through search results etc. with links such as "1 page", "2 pages", "next page" below) Unlike the page I use, ** there is no link to the next search result in the HTML of the page **. This means that existing crawler frameworks (such as Scrapy for Python) can't compete. This time, I will introduce how to crawl such a troublesome infinite scrolling page, also as my own memo.

Illustration

Even if I introduce only the theory, I will explain using the crawler that I actually wrote, which pulls past tweets from Twitter as an example. Please refer to the Github repository for the source. https://github.com/keitakurita/twitter_past_crawler

By the way,

$ pip install twitterpastcrawler

But you can install it.

Method

Infinite scrolling mechanism

So how does infinite scrolling work in the first place? Even the infinite scroll, that you load the infinite of the result in advance somewhere is the amount of data to impossible. In other words, infinite scrolling is ** dynamically ** adding additional data each time the user scrolls down. Therefore, in order for infinite scrolling to work,

Know the currently displayed range
Based on that, know the data to be fetched next You must be able to. In most cases, infinite scrolling has some ** key parameter ** that represents the ** currently displayed range ** and uses that parameter to get the following results:

For Twitter

You can analyze how this is actually achieved by looking at what kind of request Twitter is sending behind the scenes. As a test, search for the word qiita. I'm using Chrome, but any browser can see the status of the network running behind the page. In case of Chrome, you can see it from "View"-> "Development / Management"-> "Developer Tools"-> Network. When you open it, you should see a screen like the one below: Network状況を表示

If you scroll down a few times, you'll see a suspicious URL that appears several times in the list of requests:

https://twitter.com/i/search/timeline?vertical=default&q=qiita&src=typd&composed_count=0&include_available_features=1&include_entities=1&include_new_items_bar=true&interval=30000&lang=en&latent_count=0&min_position=TWEET-829694142603145216-833144090631942144-BD1UO2FFu9QAAAAAAAAETAAAAAcAAAASAAAAAAAAQAAEIIAAAAYAAAAAAACAAAAAAAgQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAgAAAAAAAQAAAAAEAAAAAAAAAAAABAAAAAAAAAAAAAIAAAAAAAAAAAAAaAAAAAAAAAAAAAAAAAAAAAAAAEAIACAIQIAAAgAAAAAAAASAAAAAAAAAAAAAAAAAAAAAA

This last parameter, min_position, is obviously suspicious. If you download the result of this response and see it, you can see that it is a json format response. Looking at the contents,

focused_refresh_interval: 240000
has_more_items: false
items_html: ...
max_position: "TWEET-829694142603145216-833155909996077056-BD1UO2FFu9QAAAAAAAAETAAAAAcAAAASAAAAAAAAQAAEIIAAAAYAAAAAAACAAAAAAAgQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAgAAAAAAAQAAAAAEAAAAAAAAAAAABAAAAAAAAAAAAAIAAAAAAAAAAAAAaAAAAAAAAAAAAAAAAAAAAAAAAEAIACAIQIAAAgAAAAAAAASAAAAAAAAAAAAAAAAAAAAAA"

ʻItems_htmlcontains the raw html of the tweet. This is the content of the tweet you are looking for. Of note is the parametermax_position. It should have the same format as the previous parameter called min_position. If you try replacing this with the min_positionin the url and send the request again, you will get a response in the same format. In other words, thismin_position` is the key parameter to be sought.

How to crawl

At this point, the rest is easy. In principle, you can crawl by repeating the following process:

Send the request by adjusting the parameters (for example, q: query) of the url in the previous format.
Get ʻitems_html and max_position` from the obtained json format response.
Process the contents of ʻitems_html` appropriately
Substitute max_position instead of min_position and send the request
Repeat steps 2-4

How to use twitterpastcrawler

In the package I created, just by giving a query, the previous process is automatically performed and the tweet information is spit out to the csv file as shown below.

`sample.py`


import twitterpastcrawler

crawler = twitterpastcrawler.TwitterCrawler(
                            query="qiita", #Search for tweets that contain the keyword qiita
                            output_file="qiita.csv" # qiita.Output tweet information to a file called csv
                        )

crawler.crawl() #Start crawling

Finally

If you can get past tweets from Twitter, you can find out what kind of tweets were made during a certain event (for example, election or game release date), which is interesting. think. Since the number of other pages with infinite scrolling is increasing, I think that the use of crawling pages with infinite scrolling will expand in the future.

How to crawl pages that scroll infinitely