I recently started studying text mining, but I had a hard time because I couldn't get good sample data. Some people may have the same troubles as themselves, so I will write an article about trial and error until I prepare sample data by myself.
In addition, I often analyze ** free descriptions (impressions, requests, etc.) ** of questionnaires at work, so my goal is to obtain data in a similar format as much as possible.
Aozora Bunko, which is often seen in text mining books, is a website that contains literary works whose copyright has disappeared. However, since it is not similar to the questionnaire data, I will not do it this time.
Consider scraping sites that collect company evaluations, such as OpenWork and Job Change Conference. It seems to be very interesting, but this time I decided that it would have a big disadvantage and decided to forgo it.
If you use the impression posting campaign run by a company, you can get data similar to the free description of the questionnaire (for example, Star Wars -cp.html) and so on). However, when I actually look at it, Twitter's habit is quite strong, so I will not do this either.
If you can extract reviews for a specific product, you will probably get data similar to the free description of a questionnaire. There are various products such as Amazon, Rakuten Ichiba, and Yahoo! Shopping, but with ** Yahoo! Shopping Product Review Search API **, you can specify the JAN code [^ 1] to get reviews! I can't think of any obvious disadvantages, so I will use this API this time.
[^ 1]: You can think of it as a barcode number for the time being.
Obtain the application ID by referring to here. If you just want to reproduce this article, you don't have to mess with the settings as below (I was able to register the site URL as http://example.com/
).
From here, run it in Python. Replace ʻappid with the application ID you got earlier (it may be listed as Client Id on the admin screen). In the following, only the product evaluation (
rate) and the review text (
description) are obtained for the time being, but if you feel like it, you can get more information (see [Official Document](https: // for details). Check developer.yahoo.co.jp/webapi/shopping/shopping/v1/reviewsearch.html). You can only get up to 50 items at a time, but if you need more than that, you should shift
start` by 50 and repeat.
import requests
import json
import pandas as pd
url = "https://shopping.yahooapis.jp/ShoppingWebService/V1/json/reviewSearch"
payload = {
"appid": "XXXXXXXXXX",
"jan": "4902777323176", #Zabas protein
"results": 50, # default... 10, max... 50
# "start": 1
}
res = json.loads(requests.get(url, params=payload).text)["ResultSet"]["Result"]
rate = [x["Ratings"]["Rate"] for x in res if x["Description"] != ""] #Evaluation
description = [x["Description"] for x in res if x["Description"] != ""] #Review
df = pd.DataFrame({
"rate": rate,
"description": description,
})
df.to_csv("review.csv", header=True, index=False)
Executing the above code will create a file called review.csv
, and you can analyze it any way you like. For the time being, when I open it in a spreadsheet and check it, it looks like this. Surprisingly, the impression is that there are more reviews about delivery than the product itself. It would be interesting to compare and analyze high-rated and low-rated reviews.
Now you can finally study text mining ...