What i did | Main events | |
---|---|---|
Episode 1 | Automatic right swipe | |
Episode 2 | Automatic message sending | Matched a woman |
Episode 3 | Library | Exchanged LINE with a matching woman |
number 3.Episode 5 | Re-acquisition of access token | Tokens could not be obtained with the previous code |
Episode 4 | Data collection | LINE replies no longer come |
Episode 5 | Data analysis Profile sentence | Information products were recommended by people I became friends with |
Episode 6 | Data analysis image edition | A real acquaintance girl calls me late at night(?) |
The code can be viewed from [GitHub] git.
I was busy preparing for the conference, and when I realized it, it had been more than two months since the last article. However, the crawler has been working all the time, so I have a lot of data that I started collecting from the last time. As usual she can't.
A lot of data has been collected. There are 10632 women who swiped. 72 of them matched. It doesn't match as much as I expected. Last time, I saved the table data in a spreadsheet and the image data in Google Drive, so I started by downloading them. When downloading a spreadsheet, you can select several file formats, but when you download it with csv or tsv, line breaks in the profile text and commas written by foreigners in the profile text are bad and annoying. So save it in .xlsx format. Also, there were about 25,000 profile images, so it took a long time to download. The analysis is done on the jupyter notebook.
Please note that this data is a verification of the data collected based on my profile, so if you execute it, the result may differ. [^ 1] Please be careful.
First, let's look at the data of the matched people.
analytics.py
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
filePath="data/tinder.xlsx"
df=pd.read_excel(filePath)
df.set_index("id", inplace=True)
match = df[df["match"]==1]
Is there anyone you don't recognize? ??
Next is the data of the person who did not match.
analytics.py
unmatch = df[df["match"]==0]
It looks like Pat, but I feel that the matching person is writing the profile text more firmly. Let's check. It is difficult to define "a well-written profile sentence", but for the time being, let's simply check the number of characters in the profile sentence.
analytics.py
%matplotlib inline
sns.distplot(unmatch["bio"].apply(lambda w:len(str(w))), color="b", bins=30)
sns.distplot(match["bio"].apply(lambda w:len(str(w))), color="r", bins=30)
The result is as follows. Those who matched red and those who did not match blue.
After all, I feel that red has less near zero characters than blue. In fact, it seems that there are many accounts that do not write a single character in the profile for those who did not match. Accounts with a blank profile are less likely to match even if you swipe right, so it seems better not to swipe.
First of all, I would like to check the words included in the profile sentence. Morphological analysis is performed on the profile sentence using the morphological analysis engine MeCab [1] and the extended dictionary mecab-ipadic-NEologd [2].
You can install MeCab with `` `pip install mecab-python3```. For how to install mecab-ipadic-NEologd, the official text [3] is very well organized, so please refer to that. You can choose various options, but for those who are really troublesome
$git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git ~/neologd
$echo yes | ~/neologd/bin/install-mecab-ipadic-neologd -n -a
Then you can install it.
Call mecab from Python to split the profile statement word by word.
Simply calling mecab will use a standard dictionary, so specify NEologd as an option. The location of the dictionary can be obtained with echo `mecab-config --dicdir`" / mecab-ipadic-neologd "
.
mecab.py
import subprocess
import MeCab
cmd = 'echo `mecab-config --dicdir`"/mecab-ipadic-neologd"'
path = (subprocess.Popen(cmd, stdout=subprocess.PIPE,
shell=True).communicate()[0]).decode('utf-8')
m = MeCab.Tagger("-d {0}".format(path))
print(m.parse("She danced in love with Pen-Pineapple Appo-Pen."))
#>>
#Her noun,Pronoun,General,*,*,*,Girlfriend,Girlfriend,Girlfriend
#Is a particle,Particle,*,*,*,*,Is,C,Wow
#Pen-Pineapple Appo-Pen Noun,Proper noun,General,*,*,*,Pen-Pineapple-Apple-Pen,Pen pineapple apple pen,Pen pineapple apple pen
#And particles,Parallel particles,*,*,*,*,When,To,To
#Koi dance noun,Proper noun,General,*,*,*,Love dance,Love dance,Love dance
#Particles,Case particles,General,*,*,*,To,Wo,Wo
#Dancing verb,Independence,*,*,Five steps, La line,Continuous connection,dance,Odd,Odd
#Auxiliary verb,*,*,*,Special,Uninflected word,Ta,Ta,Ta
#.. symbol,Kuten,*,*,*,*,。,。,。
#EOS
Use this to extract the words contained in the profile sentence from each of the matched and unmatched people.
analytics.py
def getWord(df):
retval = []
for bio in df.bio:
parse = m.parse(str(bio)).strip().split("\n")
for p in parse:
if ("\t" in p) == False:
continue
word, desc = p.split("\t")
if desc.split(",")[0] in ("noun", "verb", "adjective", "形容verb", "Adnominal adjective", "adverb", "conjunction", "感verb", "symbol"): # 助詞と助verbを除きたかった
retval.append(word)
return retval
bio_match = getWord(match)
bio_unmatch = getWord(unmatch)
The list of obtained words is displayed in order of frequency of occurrence. First of all, from the person who matched.
analytics.py
df_bio_match = pd.DataFrame.from_dict(
Counter(bio_match), orient="index").reset_index().rename(columns={"index":"word",0:"count"})
sns.barplot(data=df_bio_match.sort_values(
"count", ascending=False)[:20], x="word", y="count")
plt.xticks(rotation="vertical")
The tofu that is occurring is something like a blank space. Maybe thinsp? I was wondering who was using ∇ (Nabla), so I checked it and found that it was used for emoticons like (・ ∇ ・). Next is the person who did not match.
analytics.py
df_bio_unmatch = pd.DataFrame.from_dict(
Counter(bio_unmatch), orient="index").reset_index().rename(columns={"index":"word",0:"count"})
sns.barplot(data=df_bio_unmatch.sort_values(
"count", ascending=False)[:20], x="word", y="count")
plt.xticks(rotation="vertical")
There seems to be a difference in the tendency, it seems that it is not ... For example, it can be seen that matching people tend not to put punctuation marks in the text. Also, while there are many people who write "like" in kanji with or without a match, none of them match those who write "suki" in hiragana. ~~ Is it a land mine? ~~ To be honest, I don't think it's within the margin of error because the number of matched people is small, but it may be worth remembering.
Finally, let's vectorize the profile statement using Doc2Vec. A few years ago, DNN, which vectorizes the word Word2Vec, became a big topic in the NLP area, but Doc2Vec is an algorithm that applies it to sentences instead of words. The explanation of Word2Vec was helpful in [4], and the explanation of Doc2Vec was helpful in [5] [6]. The implementation uses a library called gensim [7]. Please install it with `` `pip install gensim```. For the specific code, I referred to [8].
analytics.py
#Divide the data into training data and test data
df_train, df_test = train_test_split(df, random_state=8888)
#Split profile sentences into words using MeCab
m_wakati = MeCab.Tagger("-d {0} -Owakati".format(path)) #In MeCab options-By adding Owakati, words are separated by spaces without outputting part of speech.
bios=[]
for bio in df_train.bio:
bio = m_wakati.parse(str(bio)).strip()
bios.append(bio)
#Convert data to a format that can be processed by gensim
trainings = [TaggedDocument(words = data.split(),tags = [i]) for i,data in enumerate(bios)]
#Learn doc2vec
doc2vec = Doc2Vec(documents=trainings, dm=1, vector_size=300, window=4, min_count=3, workers=4)
#Get training data vector
X_train = np.array([doc2vec.docvecs[i] for i in range(df_train.shape[0])])
#Get the correct label for training data
y_train = df_train["match"]
#Get test data vector and correct label
X_test = np.array([doc2vec.infer_vector(m.parse(str(bio)).split(" ")) for bio in df_test.bio])
y_test = df_test["match"]
Let's visualize the vectorized text using PCA. First of all, from the training data.
analytics.py
from sklearn.decomposition import PCA
pca = PCA()
X_reduced = pca.fit_transform(X_train)
plt.scatter(X_reduced[y_train==0][:,0], X_reduced[y_train==0][:,1], c="b", label="No Match")
plt.scatter(X_reduced[y_train==1][:,0], X_reduced[y_train==1][:,1], c="r", label="Match")
plt.legend()
For those who did not match, there are a certain number of people who have a large second principal component, while the second principal component of those who matched is generally around 0. Let's also look at the test data.
analytics.py
X_test_reduced = pca.transform(X_test)
plt.scatter(X_test_reduced[y_test==0][:,0], X_test_reduced[y_test==0][:,1], c="b", label="No Match")
plt.scatter(X_test_reduced[y_test==1][:,0], X_test_reduced[y_test==1][:,1], c="r", label="Match")
plt.legend()
Matched people can see that the second principal component is gathered near 0.
Now that we have vectorized the sentences, let's use machine learning to classify them. Classify profile statement vectors using a support vector machine. Since the data handled this time is extremely biased imbalanced data, if the decision boundary is drawn obediently, all profiles will be judged as "not matching". This is useless. In the first place, looking back on what I wanted to do this time, I didn't want to improve the accuracy of machine learning, I wanted her. Focusing on each item of the confusion matrix,
Description | Remarks | |
---|---|---|
TP | Determined to match the person who actually matches | This is what you are looking for |
TN | Judge that people who do not actually match are not matched | You can reduce unnecessary right swipe |
FP | Determined to match people who do not actually match | Right swipe is wasted once |
FN | Judge that the person who actually matches does not match | I can't meet the soul mate |
Obviously FN is the worst and I want to avoid it at all costs. On the other hand, it is desirable that FP does not occur, but it does not matter if it occurs a little. Therefore, in this task, it is required that the recall is as high as possible. On the other hand, even if the precision and F values are low, they are acceptable. Of course, if you predict that all cases will "match", you can get a high recall in exchange for the destruction of precision and F value [^ 2], so we introduced machine learning to avoid that. For that reason, if the recall goes down, I have to say that it is a fall. Therefore, this time, we will take a strategy to estimate the probability of matching by Regressor and set a fairly low threshold value to eliminate only "bad guys who are obviously not likely to match". The suspicion is a right swipe. Auc is used as the evaluation index.
analytics.py
from sklearn.svm import SVR
from sklearn.metrics import roc_auc_score
model = SVR(C=100.0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(roc_auc_score(y_test, y_pred))
#>>0.6196
Over auc0.6! Isn't this a pretty good result? The specific threshold will be set after the image analysis is completed. The article has become long, so I'm here today. Please look forward to the next profile image edition.
Episode 6 is [here] ep.6
[1]https://taku910.github.io/mecab/ [2]https://github.com/neologd/mecab-ipadic-neologd [3]https://github.com/neologd/mecab-ipadic-neologd/blob/master/README.ja.md [4] Yasuki Saito, Deep Learning from scratch ❷ ― Natural language processing [5] https://kitayamalab.wordpress.com/2016/12/10/doc2vecparagraph-vector-algorithm / [6]https://deepage.net/machine_learning/2017/01/08/doc2vec.html [7]https://radimrehurek.com/gensim/index.html [8]https://qiita.com/asian373asian/items/1be1bec7f2297b8326cf
[^ 1]: I would like to verify what kind of difference will occur if the same experiment is conducted between handsome and me. No, you may not want to see it. [^ 2]: And that was achieved by the swipe strategy that had been implemented so far, called all-right swipe.
Recommended Posts