Prepare sample data for text mining by yourself

Overview

I recently started studying text mining, but I had a hard time because I couldn't get good sample data. Some people may have the same troubles as themselves, so I will write an article about trial and error until I prepare sample data by myself.

In addition, I often analyze ** free descriptions (impressions, requests, etc.) ** of questionnaires at work, so my goal is to obtain data in a similar format as much as possible.

Examination of means

Aozora Bunko

Aozora Bunko, which is often seen in text mining books, is a website that contains literary works whose copyright has disappeared. However, since it is not similar to the questionnaire data, I will not do it this time.

Scraping

Consider scraping sites that collect company evaluations, such as OpenWork and Job Change Conference. It seems to be very interesting, but this time I decided that it would have a big disadvantage and decided to forgo it.

Twitter API

If you use the impression posting campaign run by a company, you can get data similar to the free description of the questionnaire (for example, Star Wars -cp.html) and so on). However, when I actually look at it, Twitter's habit is quite strong, so I will not do this either.

EC site API

If you can extract reviews for a specific product, you will probably get data similar to the free description of a questionnaire. There are various products such as Amazon, Rakuten Ichiba, and Yahoo! Shopping, but with ** Yahoo! Shopping Product Review Search API **, you can specify the JAN code [^ 1] to get reviews! I can't think of any obvious disadvantages, so I will use this API this time.

[^ 1]: You can think of it as a barcode number for the time being.

Hit the Yahoo! Shopping Web API

Get application ID

Obtain the application ID by referring to here. If you just want to reproduce this article, you don't have to mess with the settings as below (I was able to register the site URL as http://example.com/). image.png

Get reviews

From here, run it in Python. Replace ʻappid with the application ID you got earlier (it may be listed as Client Id on the admin screen). In the following, only the product evaluation (rate) and the review text (description) are obtained for the time being, but if you feel like it, you can get more information (see [Official Document](https: // for details). Check developer.yahoo.co.jp/webapi/shopping/shopping/v1/reviewsearch.html). You can only get up to 50 items at a time, but if you need more than that, you should shift start` by 50 and repeat.

import requests
import json
import pandas as pd
url = "https://shopping.yahooapis.jp/ShoppingWebService/V1/json/reviewSearch"
payload = {
    "appid": "XXXXXXXXXX",
    "jan": "4902777323176", #Zabas protein
    "results": 50, # default... 10, max... 50
    # "start": 1
}
res = json.loads(requests.get(url, params=payload).text)["ResultSet"]["Result"]
rate = [x["Ratings"]["Rate"] for x in res if x["Description"] != ""] #Evaluation
description = [x["Description"] for x in res if x["Description"] != ""] #Review
df = pd.DataFrame({
    "rate": rate,
    "description": description,
})
df.to_csv("review.csv", header=True, index=False)

Verification

Executing the above code will create a file called review.csv, and you can analyze it any way you like. For the time being, when I open it in a spreadsheet and check it, it looks like this. Surprisingly, the impression is that there are more reviews about delivery than the product itself. It would be interesting to compare and analyze high-rated and low-rated reviews. image.png

Now you can finally study text mining ...

Recommended Posts

Prepare sample data for text mining by yourself
Text mining (for memos)
Output elapsed time for data logging (for yourself)
Prepare a programming language environment for data analysis
Anchoco for yourself