Prepare sample data for text mining by yourself

Overview

I recently started studying text mining, but I had a hard time because I couldn't get good sample data. Some people may have the same troubles as themselves, so I will write an article about trial and error until I prepare sample data by myself.

In addition, I often analyze ** free descriptions (impressions, requests, etc.) ** of questionnaires at work, so my goal is to obtain data in a similar format as much as possible.

Examination of means

Aozora Bunko

Aozora Bunko, which is often seen in text mining books, is a website that contains literary works whose copyright has disappeared. However, since it is not similar to the questionnaire data, I will not do it this time.

merit --You can easily get a fair amount of text. ――The works you have read are easy to analyze with prior knowledge.
Demerit --The format is completely different from the questionnaire data.

Scraping

Consider scraping sites that collect company evaluations, such as OpenWork and Job Change Conference. It seems to be very interesting, but this time I decided that it would have a big disadvantage and decided to forgo it.

merit ――If you limit it to your own company, it is easy to analyze with prior knowledge. ――The attributes of respondents such as mid-career or new graduate recruitment are also substantial, and it is easy to dig deep.
Demerit ――Scraping is a little difficult, such as logging in to check articles. --If you violate the terms of use of the site, you may be hit by yourself who wrote this article.

Twitter API

If you use the impression posting campaign run by a company, you can get data similar to the free description of the questionnaire (for example, Star Wars -cp.html) and so on). However, when I actually look at it, Twitter's habit is quite strong, so I will not do this either.

merit --Since the API is in place, it is easy to use.
Demerit ――There are posts with images, and there are many URLs and hashtags, so Twitter has a strong habit.

EC site API

If you can extract reviews for a specific product, you will probably get data similar to the free description of a questionnaire. There are various products such as Amazon, Rakuten Ichiba, and Yahoo! Shopping, but with ** Yahoo! Shopping Product Review Search API **, you can specify the JAN code [^ 1] to get reviews! I can't think of any obvious disadvantages, so I will use this API this time.

merit --Since the API is in place, it is easy to use. ――If you limit it to reviews of products you are interested in, it is easy to analyze with prior knowledge.

[^ 1]: You can think of it as a barcode number for the time being.

Hit the Yahoo! Shopping Web API

Get application ID

Obtain the application ID by referring to here. If you just want to reproduce this article, you don't have to mess with the settings as below (I was able to register the site URL as http://example.com/).

Get reviews

From here, run it in Python. Replace ʻappid with the application ID you got earlier (it may be listed as Client Id on the admin screen). In the following, only the product evaluation (rate) and the review text (description) are obtained for the time being, but if you feel like it, you can get more information (see [Official Document](https: // for details). Check developer.yahoo.co.jp/webapi/shopping/shopping/v1/reviewsearch.html). You can only get up to 50 items at a time, but if you need more than that, you should shift start` by 50 and repeat.

import requests
import json
import pandas as pd
url = "https://shopping.yahooapis.jp/ShoppingWebService/V1/json/reviewSearch"
payload = {
    "appid": "XXXXXXXXXX",
    "jan": "4902777323176", #Zabas protein
    "results": 50, # default... 10, max... 50
    # "start": 1
}
res = json.loads(requests.get(url, params=payload).text)["ResultSet"]["Result"]
rate = [x["Ratings"]["Rate"] for x in res if x["Description"] != ""] #Evaluation
description = [x["Description"] for x in res if x["Description"] != ""] #Review
df = pd.DataFrame({
    "rate": rate,
    "description": description,
})
df.to_csv("review.csv", header=True, index=False)

Verification

Executing the above code will create a file called review.csv, and you can analyze it any way you like. For the time being, when I open it in a spreadsheet and check it, it looks like this. Surprisingly, the impression is that there are more reviews about delivery than the product itself. It would be interesting to compare and analyze high-rated and low-rated reviews.

Now you can finally study text mining ...