I don't have enough knowledge to analyze from scratch by myself, so this time NTT Communications has released COTOHA API. Was used.
This time I'm thinking of using the chat log as a base. In Japan, LINE is the mainstream of chat, but LINE has a chat log export function. This time, we will analyze using the chat log exported by LINE. Just in case, we have obtained the prior consent of the person.
I don't usually talk about bland things that don't touch each other's profiles so much, but it was a fairly large file. Let's start by formatting this file.
Although it is not the real thing, the LINE chat log is structured like this.
Original file sample
2019/12/22 Sun
17:00 bowtin [Sticker]
17:01 hogehogekun [Sticker]
17:02 hogehogekun Let's eat ramen if you have free time today
2019/12/23 Mon
05:00 bowtin I'm sorry I slept
05:00 bowtin [Sticker]
08:35 hogehogekun do not forgive
:
:
First, we have eliminated the following information:
As a result, it became as follows.
File after formatting
If you're free today, let's eat ramen
unforgivable
Since it is one chat and one line, it is relatively easy to understand visually. The number of lines in the formatted file was about 20500.
At the time of the formatted file, it was a fairly large chat log with 20500 lines. When I hit the API as it is, an error came back, so I divided it into files of about 500 lines each. (I should have used glob ...)
filesplitter.py
with open(file=r'\path\to\file\sample_chatlog.txt', mode='r', encoding='utf-8') as old_file:
lines = old_file.readlines()
for i in range(0, 21000, 500):
line_count = 0 + i
while line_count <= i + 500:
with open(file=r'\path\to\file\splitted_file' + str(i) + '.txt', mode='a+', encoding='utf-8') as new_file:
new_file.write(lines[line_count + i])
line_count += 1
I think there is a better way to write it, but for the time being, the purpose was to split the file, so I'm going to use this.
COTOHA API has various APIs published, but this time we will use "User attribute estimation". Did. This API is still in beta (as of February 19, 2020).
Now, let's pass all the contents of the first file to the API.
Estimated result of the first file
{"civilstatus": "Unmarried", "earnings": "1M-3M", "gender": "Female", "hobby": ["INTERNET", "MUSIC", "PAINT", "TRAVEL", "TVGAME"], "moving": ["BUS", "WALKING"], "occupation": "College student"},
it's amazing. About 80% is suitable.
I will continue.
{"civilstatus": "Unmarried", "earnings": "-1M", "gender": "Female", "hobby": ["COOKING", "FORTUNE", "GOURMET", "INTERNET", "MUSIC", "TRAVEL"], "location": "Kanto", "moving": ["RAILWAY", "WALKING"]},
This time I got a slightly different result. What is "earnings": "-1M"? Is there a negative annual income? ?? Postscript: I received a comment that it may be interpreted as "0-1M" instead of "-1M". That may be true! It means "less than 1M" or "less than 1M".
Also, this time there was information about the area. The information that can be extracted seems to differ slightly depending on the original data.
So, after that, I just passed about 40 files to the API. Since the above response is just returned, I tried to store all the returned ones in one file.
Here is the code I actually used.
main.py
#Basic information about requests to the API
BASE_URL = 'https://api.ce-cotoha.com/hogehoge/'
CLIENT_ID = 'YOUR ID'
CLIENT_SECRET = 'YOUR SECRET'
TOKEN_SERVER_URL = 'https://api.ce-cotoha.com/hogehoge/'
#A function that acquires an API access token (is it a specification that the access token is invalidated at regular intervals?...I'm sorry if it's different)
def authorization():
payload = {
'grantType': 'client_credentials',
'clientId': CLIENT_ID,
'clientSecret': CLIENT_SECRET
}
headers = {
'content-type': 'application/json'
}
response = requests.post(TOKEN_SERVER_URL, data=json.dumps(payload), headers=headers)
auth_info = response.json()
return auth_info['access_token']
#A function that makes a request to the API (argument is a list of strings)
def make_request(original_string_list):
headers = {
'Content-Type': 'application/json',
'charset': 'UTF-8',
'Authorization': 'Bearer ' + authorization()
}
payload = {
'document': original_string_list,
'type': 'kuzure' #It seems that there is a mode for it in the case of broken sentences such as chat logs.
}
response = requests.post(BASE_URL, data=json.dumps(payload), headers=headers)
jsonified_response = response.json()
return jsonified_response['result']
if __name__ == '__main__':
#List the file names of about 40 original files (this time, we have regularity in the form of file name + number)
file_list = ['splitted_file' + str(i) + '.txt' for i in range(0, 21000, 500)]
#Get one file name from the list of file names and read the contents
for a_file in file_list:
lines = []
with open(file=(r'path\to\file' + a_file), mode='r', encoding='utf-8') as file:
lines = file.readlines()
file.close()
#Throw the read content to the COTOHA API as it is and save the result in a file
with open(file=r'path\to\file\result.txt', mode='a+', encoding='utf-8') as file:
file.write(json.dumps(parse(lines)))
file.close()
sleep(1) #If you throw too many requests in a short time, it will cause trouble, so wait for 1 second
That's why some excerpts, but the list of results looks like this. Even with the same age, there are some variations.
User attribute extraction result (partial excerpt).py
[
{"age": "20-29-year-old", "civilstatus": "Unmarried", "gender": "Female", "hobby": ["COOKING", "FORTUNE", "INTERNET", "TVCOMMEDY"], "moving": ["OTHER", "WALKING"]},
{"age": "20-29-year-old", "civilstatus": "Unmarried", "gender": "Female", "hobby": ["GOURMET", "INTERNET", "SMARTPHONE_GAME", "PAINT", "TVGAME"], "moving": ["CYCLING", "OTHER", "RAILWAY", "WALKING"]},
{"civilstatus": "Unmarried", "earnings": "1M-3M", "gender": "Female", "hobby": ["COOKING", "GOURMET", "INTERNET", "SHOPPING", "TRAVEL"], "moving": ["CYCLING"]},
{"age": "30-39 years old", "civilstatus": "Unmarried", "earnings": "-1M", "gender": "Female", "hobby": ["ANIMAL", "CAMERA", "COLLECTION", "COOKING", "INTERNET", "SHOPPING"]},
{"age": "20-29-year-old", "civilstatus": "Unmarried", "earnings": "-1M", "gender": "Female", "hobby": ["COOKING", "FORTUNE", "GOURMET", "PAINT"], "location": "Tokai", "moving": ["RAILWAY"]},
{"age": "20-29-year-old", "civilstatus": "Unmarried", "earnings": "-1M", "gender": "Female", "hobby": ["ANIMAL", "DRAMA", "GYM", "SMARTPHONE_GAME", "MUSIC", "TVGAME"], "moving": ["RAILWAY", "WALKING"]}
]
I'm not sure if this is the case, so I'd like to add up. I will hard code the above result as a python dict.
parse_result.py
results = [
{"age": "20-29-year-old", "civilstatus": "Unmarried", "earnings": "-1M", "gender": "Female", "hobby": ["ANIMAL", "DRAMA", "GYM", "SMARTPHONE_GAME", "MUSIC", "TVGAME"], "moving": ["RAILWAY", "WALKING"]},
{"age": "30-39 years old", "civilstatus": "Unmarried", "earnings": "-1M", "gender": "Female", "hobby": ["ANIMAL", "CAMERA", "COLLECTION", "COOKING", "INTERNET", "SHOPPING"]}
#The following is omitted
]
from collections import Counter
import itertools
#You can use collections to retrieve the mode
print(Counter([data['age'] for data in dict_array if 'age' in data]).most_common()[0][0])
print(Counter([data['location'] for data in dict_array if 'location' in data]).most_common()[0][0])
print(Counter([data['gender'] for data in dict_array if 'gender' in data]).most_common()[0][0])
print(Counter([data['civilstatus'] for data in dict_array if 'civilstatus' in data]).most_common()[0][0])
print(Counter([data['earnings'] for data in dict_array if 'earnings' in data]).most_common()[0][0])
#Since there is a list in the list, I just throw it all into a flat list and then retrieve the mode.
print(Counter(list(itertools.chain.from_iterable([data['hobby'] for data in dict_array if 'hobby' in data]))).most_common()[0][0])
The summary of the mode is like this.
Mode
20-29-year-old
Kanto
Female
Unmarried
1M-3M
INTERNET
I think the accuracy is quite high. At least I shouldn't have talked about "whether I'm married" in a chat with this person, and of course I don't ask the simple question "Are you a woman?" The story of annual income may be a little.
By the way, I tried a little while chatting with other people, but it was mostly correct.
In the future, you may be able to find out the true profile of the person from the chat log to some extent with a matching app etc.! Considering deviations, it seems that people who usually use different characters will find out that they are using them properly.
In conclusion, it turned out that people unexpectedly spilled their profile in chat, but I think that it is difficult to understand for those who are thorough in character creation such as VTuber and so-called nekama.
Recommended Posts