Hello everyone.
Do you use a matching app? !! I feel good with one of the people who recently matched with the matching app.
By the way, the matching app I was using could refer to the data of other popular members. (Probably people who have received 100 or more like will be displayed.)
I was disappointed to see it.
"Like, I didn't get 100 ..." ** "I also want to be a man over 100" **
I thought so strongly.
At the same time, how can you become a "man over 100"? With that in mind, I analyzed the data.
We steadily manually entered other member data (fully utilizing the transcription of Google Docs) and collected about 60 data.
The data of other members displayed was for people who were close to my age of 32 years old. After that, all data is taken as 30 years old.
I analyzed the data collected steadily using the Python library.
For the features, we used the following selected items as input items.
--Like number --Face (two values, a photo with a face or not)
** I would like to see the relationship between the number of features and the number of likes that you may be interested in. ** **
Suddenly came annual income! It ’s an annual income anyway! Cusso! (I'm studying data science because I want to earn an annual income, please 10 million)
So let's draw a scatter plot.
import matplotlib.pyplot as plt
plt.scatter(data['annual income'], data['Number of likes'], alpha=0.3)
#data is a dataframe.
(Vertical is the number of likes, horizontal is the annual income)
** [Discussion] ** ** The one who doesn't have 5 million is not a man. It is impossible for a monthly income of 140,000 yen. ** ** I feel like that is said.
Surprising? The reason is that there is not much correlation between annual income and the number of likes. (The higher it is, the more like it does not increase)
When I actually put out the correlation coefficient ...
pd.DataFrame({"x":data['annual income'], "y":data['Number of likes']}).corr()
Annual income x like correlation coefficient
x y
x 1.00000 -0.06363
y -0.06363 1.00000
It can be said that there is almost no correlation.
People who get a lot of like are likely to have something other than their annual income. (* However, it is limited to 5 million or more)
Which other features are involved? Let's take a look.
The selection values for educational background were as follows. Junior college/Vocational school/College graduate|High school graduate|University graduate|Graduate school graduate|Other
It's hard to handle in Japanese, so
data['Educational background'] = data['Educational background'].replace({'Junior college/Vocational school/College graduate':0 ,'High school graduate':1 ,'University graduate':2 ,'Graduate school graduate':3 ,'Other':4})
Label encoding was done in the form of.
Let's draw a scatter plot in this form.
plt.scatter(data['Educational background'], data['Number of likes'], alpha=0.3)
result... (I'm sorry I didn't adjust the scale.)
'Junior college / vocational school / technical college graduate': 0,'High school graduate': 1,'University graduate': 2,'Graduate school': 3,'Other': 4 is.
** After all, there are many universities and above. .. .. ** **
Here, instead of the correlation coefficient ,,,, in this case, it is the relationship between the quantitative variable and the qualitative variable, so the correlation ratio is calculated. Since educational background cannot be quantified (a qualitative variable because we do not know how much difference there is between university graduates and graduate school graduates), let's examine whether the result of the number of likes is biased for each selected value.
A function that came from somewhere
def corr_ratio(x, y):
variation = ((x - x.mean()) ** 2).sum()
#print(" variation", variation)
inter_class = sum([((x[y == i] - x[y == i].mean()) ** 2).sum() for i in np.unique(y)])
#print(" inter_class", inter_class)
return (inter_class / variation)
#Calculate correlation ratio
corr_ratio(data.loc[:, ["Number of likes"]].values, data.loc[:, ['Educational background']].values)
result
# 0.8820459777290447
There seems to be some correlation.
** [Discussion] ** ** At least you should be out of college ** I feel like that is said.
Will you forgive me even if it's low? I'm asking you, so don't be like your annual income. .. ..
Let's take a look.
#Plot excluding NaN lines
plt.scatter(data.loc[data["height"].notnull()]["height"], data.loc[data["height"].notnull()]['Number of likes'], alpha=0.3)
...?
This makes it difficult to understand the height distribution, so let's put it out in a histogram.
plt.hist(data["height"].astype(np.float32))
(Round to 5 cm.)
By the way, the average height of Japanese men is about 170 cm ... ** It's unforgiving **
Let's get the correlation coefficient here as well.
pd.DataFrame({"x":data['height'].astype(np.float32), "y":data['Number of likes']}).corr()
Height x like correlation coefficient
x y
x 1.000000 0.073241
y 0.073241 1.000000
There is almost no correlation here either.
** [Discussion] ** ** I don't say 180cm, but I want 175cm ** I feel like that is said.
I'm 171 cm ... ~~ ○ Really ~~
The body type options are as follows. 'Slim': 0,'Slightly thin': 1,'Normal': 2,'Muscle': 3,'Slightly chubby': 4,'Chubby': 5
I will try plotting after replacing with the above
#Plot excluding NaN lines
plt.scatter(data.loc[data["Body type"].notnull()]["Body type"], data.loc[data["Body type"].notnull()]['Like number'], alpha=0.3)
(I'm sorry I didn't adjust the scale again.)
Well, it looks like a normal distribution, or rather, there are most people with normal body shapes and few others, so it looks like there is no bias.
[2019/11/12 postscript-] Let's try again with the histogram.
plt.hist(data_original['Body type'].astype(np.float32))
(Scale ry)
The first is normal and the second is muscular. [--2019/11/12 postscript]
Next, let's get the correlation ratio.
#Calculate correlation ratio
corr_ratio(data.loc[:, ["Like number"]].values, data.loc[:, ['Body type']].values)
result
0.9457908220700801
It's a pretty good number. Honestly, the reliability is unknown because there is little data, but the number of people is the largest ** Normal or muscular is good. ** **
** [Discussion] ** ** Normal or muscular ** Aim for normal body shape
I think you could see most of the things you were interested in. After that, I will try to make a like number prediction machine using machine learning.
Labeling was done to perform the regression, but for most items the numbers are not proportional to their importance. For example, just because you replaced "others" with 4 as you did earlier, it does not mean that it is better than 3 "university graduates" that are less than 4.
So this time we did frequency labeling. Frequency labeling is not discussed here, but the higher the frequency, the higher the number. (I think it's quite reasonable considering the hypothesis that the choices that people with a lot of likes must choose.)
def labeling(data):
for column in data.columns:
#Avoid Likes because they are objective variables. Height and annual income will be standardized later.
if not column in ['Likes', 'Heights', 'Salary']:
#size of each category
freq_encoding[column] = data.groupby(column).size()
#get frequency of each category
freq_encoding[column] = freq_encoding[column]/len(data)
#print(encoding)
data[column] = data[column].map(freq_encoding[column])
freq_encoding[column] = freq_encoding[column].to_dict()
return freq_encoding, data
# freq_Encoding is reused when frequently labeling the data you want to predict in the future.
freq_encoding, data = labeling(data)
Height and annual income have been standardized.
def normalize(data):
# #Standardization
#Height: says 171.5(average)/5.8(standard deviation)Seems to be
data['Heights'] = ((data['Heights']-171.5)/5.8)
#No standard deviation can be obtained for annual income...
data['Salary'] = ((data['Salary']-data['Salary'].mean())/data['Salary'].std())
return data
data = normalize(data)
On top of that, I would like to find out the correlation coefficient between each feature and the number of likes. In this context, the fact that there is a correlation means that there are options that the strong men are all choosing.
How is it? Look at the Likes column (first column).
・ Background (educational background) ・ WantAKids (Do you want children) ・ Sociality ・ Alcohol (liquor) There is some correlation in. (The closer the value is to 1, the more correlated it is)
In short, if you imitate a strong man around here, you may get more like! !!
The process is broken, but when I removed various features, the accuracy was the best because the following features were not included.
・ Presence or absence of face ・ Body type ·annual income
It wasn't enough to cut things with a low correlation ratio / correlation coefficient, so I wonder if there is no choice but to actually try reducing the features.
It's quite surprising that body shape and annual income have nothing to do with the number of likes. (However, don't forget that ** the annual income is limited to 5 million or more **!)
[2019/11/14 postscript-] 【apology】 Regarding the presence or absence of a face, the actual flag is "Is the face visible on the first sheet?" (I forgot to register multiple images.) So it's not that the face doesn't matter. [--2019/11/14 postscript]
This time, I didn't use deep learning because I was studying machine learning. When I tried linear regression (normal, Lasso, Ridge), deterministic regression tree, and SVR, SVR was the best.
In addition, although it is said that the holdout method should not be done so much when this data set is small, we were able to obtain an accuracy of about 83%.
Overfitting is possible, but it can't be helped even if it takes too much time, so we will proceed with this accuracy.
[2019/11/12 postscript-] By the way, train and test are about 8: 2. [--2019/11/12 postscript]
I quit after less than a month, but my number of likes was about 80 (Tohoho ...
I tried to see if I could predict my data correctly.
My data is below.
my_df = pd.DataFrame({
'Like number': 80.0,
'face': 'Yes',
'blood type': 'O type',
'Brothers and sisters': 'Eldest son',
'Educational background': 'University graduate',
'school name': 'None',
'Occupation': 'IT related', #2019/11/11 I forgot that I was a company employee as a trial.
'annual income': '***', # annual incomeは関係ないので秘密
'height': '171',
'Body type': 'usually',
'Marriage history': 'Single (unmarried)',
'Willingness to marry': 'I want to have a good person',
'Do you want children': 'do not know',
'Housework / childcare': 'I want to participate actively',
'Hope until we meet': 'I want to see you if you feel like it',
'First date cost': 'Men pay everything',
'Sociability': 'I like small groups',
'Housemate': 'Living alone',
'holiday': 'Saturday and Sunday',
'sake': 'to drink',
'tobacco': 'Do not smoke',
'name_alpha': 0
}, index=[0])
##Label encoding and frequency encoding below...
#Drop the number of likes and convert to numpy
X = my_df.iloc[:, 1:].values
#Prediction! !! !! !!
print(model_svr_rbf1.predict(X))
result
[73.22405579]
Don't give out numbers in the distance !!!! (Machine learning is amazing)
(2019/11/11: I forgot that I made the job type a company employee as a trial (result: about 64-like). I fixed it to the IT related input at that time. The IT related is a better impression! ?)
By the way, after this, if I put in my own value and trained, the accuracy (overall) improved (about 83 → 86%), so there is a great possibility that the like I got was a reasonable number (tears).
...
'Educational background': 'Graduate school graduate', #Change from college graduate
...
#result
[207.56731856]
A terrifying educational background.
...
'height': '180', #Changed from 171
...
#result
[164.67592949]
Terrifying height.
The prediction machine was made with a sloppy accuracy of about 86%. Does this mean that the number of likes changes depending on the choices?
And there was no relationship between annual income (if it was 5 million or more) and body shape (although I am a normal body type). In other words, there may have been a factor that did not make the choice I chose.
Based on the results, it seems that the annual income has nothing to do with it, so I will do my best to grow taller in the future.