Purpose

Because there was an interesting campaign
I was wondering how much people unintentionally exposed their profile.

Environment and data

API for analysis

I don't have enough knowledge to analyze from scratch by myself, so this time NTT Communications has released COTOHA API. Was used.

raw data

This time I'm thinking of using the chat log as a base. In Japan, LINE is the mainstream of chat, but LINE has a chat log export function. This time, we will analyze using the chat log exported by LINE. Just in case, we have obtained the prior consent of the person.

The file was too big

I don't usually talk about bland things that don't touch each other's profiles so much, but it was a fairly large file. Let's start by formatting this file.

Although it is not the real thing, the LINE chat log is structured like this.

`Original file sample`


2019/12/22 Sun
17:00　bowtin [Sticker]
17:01  hogehogekun [Sticker]
17:02 hogehogekun Let's eat ramen if you have free time today

2019/12/23 Mon
05:00 bowtin I'm sorry I slept
05:00  bowtin [Sticker]
08:35 hogehogekun do not forgive
   ：
   ：

First, we have eliminated the following information:

Post date, day of the week, time information
Name of poster
Your own post (only the other party's post is analyzed
System-related strings such as [Sticker] and [Photo]

As a result, it became as follows.

`File after formatting`


If you're free today, let's eat ramen
unforgivable

Since it is one chat and one line, it is relatively easy to understand visually. The number of lines in the formatted file was about 20500.

Divide the file into about 500 lines

At the time of the formatted file, it was a fairly large chat log with 20500 lines. When I hit the API as it is, an error came back, so I divided it into files of about 500 lines each. (I should have used glob ...)

`filesplitter.py`



with open(file=r'\path\to\file\sample_chatlog.txt', mode='r', encoding='utf-8') as old_file:
    lines = old_file.readlines()

    for i in range(0, 21000, 500):
        line_count = 0 + i
        while line_count <= i + 500:
            with open(file=r'\path\to\file\splitted_file' + str(i) + '.txt', mode='a+', encoding='utf-8') as new_file:
                new_file.write(lines[line_count + i])
                line_count += 1

I think there is a better way to write it, but for the time being, the purpose was to split the file, so I'm going to use this.

User attribute estimation using COTOHA API

COTOHA API has various APIs published, but this time we will use "User attribute estimation". Did. This API is still in beta (as of February 19, 2020).

Now, let's pass all the contents of the first file to the API.

`Estimated result of the first file`


{"civilstatus": "Unmarried", "earnings": "1M-3M", "gender": "Female", "hobby": ["INTERNET", "MUSIC", "PAINT", "TRAVEL", "TVGAME"], "moving": ["BUS", "WALKING"], "occupation": "College student"},

it's amazing. About 80% is suitable.

I will continue.

{"civilstatus": "Unmarried", "earnings": "-1M", "gender": "Female", "hobby": ["COOKING", "FORTUNE", "GOURMET", "INTERNET", "MUSIC", "TRAVEL"], "location": "Kanto", "moving": ["RAILWAY", "WALKING"]},

This time I got a slightly different result. What is "earnings": "-1M"? Is there a negative annual income? ?? Postscript: I received a comment that it may be interpreted as "0-1M" instead of "-1M". That may be true! It means "less than 1M" or "less than 1M".

Also, this time there was information about the area. The information that can be extracted seems to differ slightly depending on the original data.

So, after that, I just passed about 40 files to the API. Since the above response is just returned, I tried to store all the returned ones in one file.

Here is the code I actually used.

`main.py`


#Basic information about requests to the API
BASE_URL = 'https://api.ce-cotoha.com/hogehoge/'
CLIENT_ID = 'YOUR ID'
CLIENT_SECRET = 'YOUR SECRET'
TOKEN_SERVER_URL = 'https://api.ce-cotoha.com/hogehoge/'


#A function that acquires an API access token (is it a specification that the access token is invalidated at regular intervals?...I'm sorry if it's different)
def authorization():
    payload = {
        'grantType': 'client_credentials',
        'clientId': CLIENT_ID,
        'clientSecret': CLIENT_SECRET
    }
    headers = {
        'content-type': 'application/json'
    }
    response = requests.post(TOKEN_SERVER_URL, data=json.dumps(payload), headers=headers)
    auth_info = response.json()

    return auth_info['access_token']


#A function that makes a request to the API (argument is a list of strings)
def make_request(original_string_list):
    headers = {
        'Content-Type': 'application/json',
        'charset': 'UTF-8',
        'Authorization': 'Bearer ' + authorization()
    }

    payload = {
        'document': original_string_list,
        'type': 'kuzure' #It seems that there is a mode for it in the case of broken sentences such as chat logs.
    }

    response = requests.post(BASE_URL, data=json.dumps(payload), headers=headers)

    jsonified_response = response.json()
    return jsonified_response['result']


if __name__ == '__main__':
    #List the file names of about 40 original files (this time, we have regularity in the form of file name + number)
    file_list = ['splitted_file' + str(i) + '.txt' for i in range(0, 21000, 500)]

    #Get one file name from the list of file names and read the contents
    for a_file in file_list:
        lines = []
        with open(file=(r'path\to\file' + a_file), mode='r', encoding='utf-8') as file:
            lines = file.readlines()
            file.close()
        
        #Throw the read content to the COTOHA API as it is and save the result in a file
        with open(file=r'path\to\file\result.txt', mode='a+', encoding='utf-8') as file:
            file.write(json.dumps(parse(lines)))
            file.close()
            sleep(1) #If you throw too many requests in a short time, it will cause trouble, so wait for 1 second

result

That's why some excerpts, but the list of results looks like this. Even with the same age, there are some variations.

`User attribute extraction result (partial excerpt).py`


[
{"age": "20-29-year-old", "civilstatus": "Unmarried", "gender": "Female", "hobby": ["COOKING", "FORTUNE", "INTERNET", "TVCOMMEDY"], "moving": ["OTHER", "WALKING"]},
{"age": "20-29-year-old", "civilstatus": "Unmarried", "gender": "Female", "hobby": ["GOURMET", "INTERNET", "SMARTPHONE_GAME", "PAINT", "TVGAME"], "moving": ["CYCLING", "OTHER", "RAILWAY", "WALKING"]},
{"civilstatus": "Unmarried", "earnings": "1M-3M", "gender": "Female", "hobby": ["COOKING", "GOURMET", "INTERNET", "SHOPPING", "TRAVEL"], "moving": ["CYCLING"]},
{"age": "30-39 years old", "civilstatus": "Unmarried", "earnings": "-1M", "gender": "Female", "hobby": ["ANIMAL", "CAMERA", "COLLECTION", "COOKING", "INTERNET", "SHOPPING"]},
{"age": "20-29-year-old", "civilstatus": "Unmarried", "earnings": "-1M", "gender": "Female", "hobby": ["COOKING", "FORTUNE", "GOURMET", "PAINT"], "location": "Tokai", "moving": ["RAILWAY"]},
{"age": "20-29-year-old", "civilstatus": "Unmarried", "earnings": "-1M", "gender": "Female", "hobby": ["ANIMAL", "DRAMA", "GYM", "SMARTPHONE_GAME", "MUSIC", "TVGAME"], "moving": ["RAILWAY", "WALKING"]}
]

I'm not sure if this is the case, so I'd like to add up. I will hard code the above result as a python dict.

`parse_result.py`


results = [ 
  {"age": "20-29-year-old", "civilstatus": "Unmarried", "earnings": "-1M", "gender": "Female", "hobby": ["ANIMAL", "DRAMA", "GYM", "SMARTPHONE_GAME", "MUSIC", "TVGAME"], "moving": ["RAILWAY", "WALKING"]},
  {"age": "30-39 years old", "civilstatus": "Unmarried", "earnings": "-1M", "gender": "Female", "hobby": ["ANIMAL", "CAMERA", "COLLECTION", "COOKING", "INTERNET", "SHOPPING"]}
  #The following is omitted
]


from collections import Counter
import itertools

#You can use collections to retrieve the mode
print(Counter([data['age'] for data in dict_array if 'age' in data]).most_common()[0][0])
print(Counter([data['location'] for data in dict_array if 'location' in data]).most_common()[0][0])
print(Counter([data['gender'] for data in dict_array if 'gender' in data]).most_common()[0][0])
print(Counter([data['civilstatus'] for data in dict_array if 'civilstatus' in data]).most_common()[0][0])
print(Counter([data['earnings'] for data in dict_array if 'earnings' in data]).most_common()[0][0])

#Since there is a list in the list, I just throw it all into a flat list and then retrieve the mode.
print(Counter(list(itertools.chain.from_iterable([data['hobby'] for data in dict_array if 'hobby' in data]))).most_common()[0][0])

The summary of the mode is like this.

`Mode`


20-29-year-old
Kanto
Female
Unmarried
1M-3M
INTERNET

Consideration

I think the accuracy is quite high. At least I shouldn't have talked about "whether I'm married" in a chat with this person, and of course I don't ask the simple question "Are you a woman?" The story of annual income may be a little.

By the way, I tried a little while chatting with other people, but it was mostly correct.

In the future, you may be able to find out the true profile of the person from the chat log to some extent with a matching app etc.! Considering deviations, it seems that people who usually use different characters will find out that they are using them properly.

In conclusion, it turned out that people unexpectedly spilled their profile in chat, but I think that it is difficult to understand for those who are thorough in character creation such as VTuber and so-called nekama.

Is it possible to extract the person's profile information from the chat log?

Purpose

Environment and data

API for analysis

raw data

The file was too big

Original file sample

File after formatting

Divide the file into about 500 lines

filesplitter.py

User attribute estimation using COTOHA API

Estimated result of the first file

main.py