Introduction

[First post] I wanted to get the article in Qiita by specifying the tag, so I implemented it in Python. The reason I worked on it in the first place was that I used the Livedoor News Corpus to classify article categories by machine learning, and I was advised that I would like to do the same with Qiita articles. .. .. It may be a little difficult to understand how to write the code, but in that case, please let me know in the comments.

About Qiita API

It is a Web-API provided by Qiita that allows you to acquire various data and post articles. https://qiita.com/api/v2/docs

There is an upper limit for acquiring articles, the upper limit for pages is 100 at a time, and the upper limit for per_page (how many articles are acquired for each page) is 100, so a maximum of 10,000 articles can be acquired.

However, user authentication is required, so be careful.

Accepts requests up to 1000 times per user per hour in the authenticated state and up to 60 times per hour per IP address in the unauthenticated state. (From Qiita API official)

This time, I want to get a total of 900 articles per page, so I will do page = 100, per_page = 1 x 9 times.

How to get access token for Qiita API

First, get the access token required for user authentication.

-Select "Application" from "Settings" スクリーンショット 2020-02-01 19.34.53.png

・ "Personal access token" → "Issue a new token" スクリーンショット 2020-02-01 19.36.55.png

・ This time, put a check mark only for read_qiita and "issue" スクリーンショット 2020-02-01 19.38.26.png

・ A token will be issued, so copy it. スクリーンショット 2020-02-01 19.39.27.png

Code example for user authentication of Qiita API

#Header required for user authentication
h = {'Authorization': 'Bearer [Obtained access token]'}
connect = http.client.HTTPSConnection("qiita.com")
url = "/api/v2/items?"

Code example to get the article

#Specify the tag you want to get
query = "&query=tag%3A" + tag_name
#Get the number of articles created within the period specified in the search
connect.request("GET", url + query, headers=h)
#Response to request
res = connect.getresponse()
#Read response
res.read()
#Response from the server
print(res.status, res.reason)
total_count = int(res.headers['Total-Count'])
print("total_count: " + str(total_count))
#Get data and write 100 articles to txt file
for pg in range(100):
    pg += 1
    page = "page=" + str(pg) + "&per_page=1"
    connect.request("GET", url + page + query, headers=h)
    res = connect.getresponse()
    data = res.read().decode("utf-8")
    #pandas json file data.Stored in DataFrame format
    df = pd.read_json(data)
    #Specifying a txt file
    filename = "./qiita/" + tag_name + "/page/" + str(pg) + ".txt"
    #Get title and text from Qiita article
    df.to_csv(filename, columns=[
       'title',
       'body',
    ], header=False, index=False)
    print(str(pg) + "/" + str(100) + "Done")

Explanation of the above code

User authentication

In user authentication, in the header

`'Bearer [Obtained access token]'}`


 It is necessary to specify the token for authentication as in.

## Get json file
 In Qiita API, the posted data is a json file.
https://qiita.com/api/v2/docs#%E6%8A%95%E7%A8%BF

 When getting it, I use the read_json function of the pandas library to convert it to pandas DataFrame format.

# Code to get 900 articles with the specified tag
 Here is the whole code.

```python
#Library import
import http.client
import pandas as pd
import time
#Number of pages you want to get
TOTAL_PAGE = 900
TIME = int(TOTAL_PAGE / 100)
PER_PAGE = 1

#User authentication
h = {'Authorization': 'Bearer [Obtained access token]'}
connect = http.client.HTTPSConnection("qiita.com")
url = "/api/v2/items?"

#Tag to specify
tag_name = "Java"

#Count variable
num = 0
pg = 0
count = 0

#Get articles by tag repeatedly only for PAGE
query = "&query=tag%3A" + tag_name
#Get the number of articles created within the period specified in the search
connect.request("GET", url + query, headers=h)
#Response to request
res = connect.getresponse()
#Read response
res.read()
#Response from the server
print(res.status, res.reason)
print("Specified tag: " + tag_name)
total_count = int(res.headers['Total-Count'])
print("total_count: " + str(total_count))

#Get data and write 900 articles to txt file
for count in range(TIME):
    count += 1
    for pg in range(100):
        pg += 1
        page = "page=" + str(pg) + "&per_page=" + str(PER_PAGE)
        connect.request("GET", url + page + query, headers=h)
        res = connect.getresponse()
        data = res.read().decode("utf-8")
        df = pd.read_json(data)
        filename = "./qiita/" + tag_name + "/page" + str(count) + "-" + str(pg) + ".txt"
        df.to_csv(filename, columns=[
            'title',
            'body',
        ], header=False, index=False)
        print(str(count) + ":" + str(pg) + "/" + str(100) + "Done")

result

It's hard to understand, but I got 900 articles.

Summary

This time, I got it by specifying the title and body of the article, but I can also get the "number of likes" and "update date", so if you want other items, please refer to the Qiita API official. Please try!

Reference material

・ Qiita API official https://qiita.com/api/v2/docs

・ Get Qiita article information with API and write it to CSV https://qiita.com/arai-qiita/items/94902fc0e686e59cb8c5

How to get article data using Qiita API