[First post] I wanted to get the article in Qiita by specifying the tag, so I implemented it in Python. The reason I worked on it in the first place was that I used the Livedoor News Corpus to classify article categories by machine learning, and I was advised that I would like to do the same with Qiita articles. .. .. It may be a little difficult to understand how to write the code, but in that case, please let me know in the comments.
It is a Web-API provided by Qiita that allows you to acquire various data and post articles. https://qiita.com/api/v2/docs
There is an upper limit for acquiring articles, the upper limit for pages is 100 at a time, and the upper limit for per_page (how many articles are acquired for each page) is 100, so a maximum of 10,000 articles can be acquired.
However, user authentication is required, so be careful.
Accepts requests up to 1000 times per user per hour in the authenticated state and up to 60 times per hour per IP address in the unauthenticated state. (From Qiita API official)
This time, I want to get a total of 900 articles per page, so I will do page = 100, per_page = 1 x 9 times.
First, get the access token required for user authentication.
-Select "Application" from "Settings"
・ "Personal access token" → "Issue a new token"
・ This time, put a check mark only for read_qiita and "issue"
・ A token will be issued, so copy it.
#Header required for user authentication
h = {'Authorization': 'Bearer [Obtained access token]'}
connect = http.client.HTTPSConnection("qiita.com")
url = "/api/v2/items?"
#Specify the tag you want to get
query = "&query=tag%3A" + tag_name
#Get the number of articles created within the period specified in the search
connect.request("GET", url + query, headers=h)
#Response to request
res = connect.getresponse()
#Read response
res.read()
#Response from the server
print(res.status, res.reason)
total_count = int(res.headers['Total-Count'])
print("total_count: " + str(total_count))
#Get data and write 100 articles to txt file
for pg in range(100):
pg += 1
page = "page=" + str(pg) + "&per_page=1"
connect.request("GET", url + page + query, headers=h)
res = connect.getresponse()
data = res.read().decode("utf-8")
#pandas json file data.Stored in DataFrame format
df = pd.read_json(data)
#Specifying a txt file
filename = "./qiita/" + tag_name + "/page/" + str(pg) + ".txt"
#Get title and text from Qiita article
df.to_csv(filename, columns=[
'title',
'body',
], header=False, index=False)
print(str(pg) + "/" + str(100) + "Done")
In user authentication, in the header
'Bearer [Obtained access token]'}
It is necessary to specify the token for authentication as in.
## Get json file
In Qiita API, the posted data is a json file.
https://qiita.com/api/v2/docs#%E6%8A%95%E7%A8%BF
When getting it, I use the read_json function of the pandas library to convert it to pandas DataFrame format.
# Code to get 900 articles with the specified tag
Here is the whole code.
```python
#Library import
import http.client
import pandas as pd
import time
#Number of pages you want to get
TOTAL_PAGE = 900
TIME = int(TOTAL_PAGE / 100)
PER_PAGE = 1
#User authentication
h = {'Authorization': 'Bearer [Obtained access token]'}
connect = http.client.HTTPSConnection("qiita.com")
url = "/api/v2/items?"
#Tag to specify
tag_name = "Java"
#Count variable
num = 0
pg = 0
count = 0
#Get articles by tag repeatedly only for PAGE
query = "&query=tag%3A" + tag_name
#Get the number of articles created within the period specified in the search
connect.request("GET", url + query, headers=h)
#Response to request
res = connect.getresponse()
#Read response
res.read()
#Response from the server
print(res.status, res.reason)
print("Specified tag: " + tag_name)
total_count = int(res.headers['Total-Count'])
print("total_count: " + str(total_count))
#Get data and write 900 articles to txt file
for count in range(TIME):
count += 1
for pg in range(100):
pg += 1
page = "page=" + str(pg) + "&per_page=" + str(PER_PAGE)
connect.request("GET", url + page + query, headers=h)
res = connect.getresponse()
data = res.read().decode("utf-8")
df = pd.read_json(data)
filename = "./qiita/" + tag_name + "/page" + str(count) + "-" + str(pg) + ".txt"
df.to_csv(filename, columns=[
'title',
'body',
], header=False, index=False)
print(str(count) + ":" + str(pg) + "/" + str(100) + "Done")
It's hard to understand, but I got 900 articles.
This time, I got it by specifying the title and body of the article, but I can also get the "number of likes" and "update date", so if you want other items, please refer to the Qiita API official. Please try!
・ Qiita API official https://qiita.com/api/v2/docs
・ Get Qiita article information with API and write it to CSV https://qiita.com/arai-qiita/items/94902fc0e686e59cb8c5
Recommended Posts