You can see it on Facebook and Twitter timelines, scroll to the bottom of the page and it will load new information.
The reason I decided to crawl the infinite scroll page was because I had to pull in past tweets on Twitter because of school issues. Well, you say that Twitter has an official API. The official Twitter API isn't very kind, and it's designed so that you can't get tweets older than a week **. In other words, if you want to get older tweets, you have to crawl yourself. And since Twitter search results are displayed with ** infinite scroll **, you have to crawl the page that scrolls infinitely.
The crawler basically works as follows:
In this way, a large amount of data is fetched from the net. The problem with crawling infinite scrolling pages is paging (how to navigate through search results etc. with links such as "1 page", "2 pages", "next page" below) Unlike the page I use, ** there is no link to the next search result in the HTML of the page **. This means that existing crawler frameworks (such as Scrapy for Python) can't compete. This time, I will introduce how to crawl such a troublesome infinite scrolling page, also as my own memo.
Even if I introduce only the theory, I will explain using the crawler that I actually wrote, which pulls past tweets from Twitter as an example. Please refer to the Github repository for the source. https://github.com/keitakurita/twitter_past_crawler
By the way,
$ pip install twitterpastcrawler
But you can install it.
So how does infinite scrolling work in the first place? Even the infinite scroll, that you load the infinite of the result in advance somewhere is the amount of data to impossible. In other words, infinite scrolling is ** dynamically ** adding additional data each time the user scrolls down. Therefore, in order for infinite scrolling to work,
You can analyze how this is actually achieved by looking at what kind of request Twitter is sending behind the scenes. As a test, search for the word qiita. I'm using Chrome, but any browser can see the status of the network running behind the page. In case of Chrome, you can see it from "View"-> "Development / Management"-> "Developer Tools"-> Network. When you open it, you should see a screen like the one below:
If you scroll down a few times, you'll see a suspicious URL that appears several times in the list of requests:
https://twitter.com/i/search/timeline?vertical=default&q=qiita&src=typd&composed_count=0&include_available_features=1&include_entities=1&include_new_items_bar=true&interval=30000&lang=en&latent_count=0&min_position=TWEET-829694142603145216-833144090631942144-BD1UO2FFu9QAAAAAAAAETAAAAAcAAAASAAAAAAAAQAAEIIAAAAYAAAAAAACAAAAAAAgQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAgAAAAAAAQAAAAAEAAAAAAAAAAAABAAAAAAAAAAAAAIAAAAAAAAAAAAAaAAAAAAAAAAAAAAAAAAAAAAAAEAIACAIQIAAAgAAAAAAAASAAAAAAAAAAAAAAAAAAAAAA
This last parameter, min_position
, is obviously suspicious. If you download the result of this response and see it, you can see that it is a json format response. Looking at the contents,
focused_refresh_interval: 240000
has_more_items: false
items_html: ...
max_position: "TWEET-829694142603145216-833155909996077056-BD1UO2FFu9QAAAAAAAAETAAAAAcAAAASAAAAAAAAQAAEIIAAAAYAAAAAAACAAAAAAAgQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAgAAAAAAAQAAAAAEAAAAAAAAAAAABAAAAAAAAAAAAAIAAAAAAAAAAAAAaAAAAAAAAAAAAAAAAAAAAAAAAEAIACAIQIAAAgAAAAAAAASAAAAAAAAAAAAAAAAAAAAAA"
ʻItems_htmlcontains the raw html of the tweet. This is the content of the tweet you are looking for. Of note is the parameter
max_position. It should have the same format as the previous parameter called
min_position. If you try replacing this with the
min_positionin the url and send the request again, you will get a response in the same format. In other words, this
min_position` is the key parameter to be sought.
At this point, the rest is easy. In principle, you can crawl by repeating the following process:
and
max_position` from the obtained json format response.max_position
instead of min_position
and send the requestIn the package I created, just by giving a query, the previous process is automatically performed and the tweet information is spit out to the csv file as shown below.
sample.py
import twitterpastcrawler
crawler = twitterpastcrawler.TwitterCrawler(
query="qiita", #Search for tweets that contain the keyword qiita
output_file="qiita.csv" # qiita.Output tweet information to a file called csv
)
crawler.crawl() #Start crawling
If you can get past tweets from Twitter, you can find out what kind of tweets were made during a certain event (for example, election or game release date), which is interesting. think. Since the number of other pages with infinite scrolling is increasing, I think that the use of crawling pages with infinite scrolling will expand in the future.
Recommended Posts