Participated in the shoulder break-in beginner-only competition (https://signate.jp/competitions/293) for the second term of AIQuest starting in October 2020. The competition I participated in to participate in AIQuest was not so good, but I managed to get involved. I thought that I couldn't get results as it was, so I was able to spend some time to participate in this competition, although it was from the latter half of September.
This competition was clear if a certain score was given, and the ranking was not very meaningful, but I made an effort to raise the ranking "for studying" and "to gain confidence". As a result, I was lucky enough to take first place, so I would like to introduce what I did this time.
This time, blood data, age, and gender are used to determine whether or not the disease is liver disease. The evaluation function uses AUC. The clearing condition is that AUC = 0.92 is exceeded.
The environment is Google Colaboratory
We will pick up what is commonly done in the analysis of table data.
Check what each data means, what the data type is, and whether there are any missing values. After that, let's visualize how much bias there is.
Let's learn without devising anything and see what is important. This work seems to be important for creating an image in order to make assumptions even if only a few impressions are given. (Depending on how you think about it, the impression you get here may be a shackle)
Support Vector Machines
, KNN
, Logistic Regression
, Random Forest
, Naive Bayes
, Perceptron
, Stochastic Gradient Decent
, Linear SVC
, Decision Tree
,catboost
I adopted catboost
.
Since there were no missing values this time, I scored with various models without thinking about anything, and the one with the best result was set as first_commit. I think the result was around 0.8
In ↑, it was 0.8, which is far from the clearing condition, so I couldn't clear it without any ingenuity. In my case, I was able to meet the clearing conditions by doing the following two ideas (maybe not).
At first glance, it seems to have something to do with whether or not it is a liver disease (the correlation was actually high), but removing it improved the accuracy.
I really wanted to erase it for some reason, but I couldn't grasp it and said, "I tried to erase it mechanically and it worked."
By the time you reach here
--Try TargetEncoding
--Dividing your age into teens, 20s ...
Was carried out, but none of them produced any results.
The accuracy increases to about 0.83 here.
assessment.py
#Output with 0 or 1
model.predict(pred)
#Probability notation
model.predict_proba(pred)[:, 1]
This reached the pass line of 0.83 ⇒ 0.92.
The competition was irrelevant to the ranking, but I was motivated, so I tried to improve the accuracy. The following is what the score went up.
Added the feature amount of whether it is within the range of normal blood value from a medical point of view. Although it is possible to judge by the value, it is difficult to make a comprehensive judgment because all the units are different, and I wanted a feature amount to judge only by "whether it is within the normal value range", so I adopted it. In fact, this method was quite useful, and just doing this improved my ranking to the top 10.
Remove irregular data of training data with the knowledge gained in ↑
"Data that is judged to be normal even though most of the numerical values are outliers" that could not be found by simply deleting the data that has prominent data mechanically was removed. Inference of 1 or 0 may not have much effect, but since it is a probability notation, by erasing the data that is the exact opposite of the tendency, it will be possible to express outstanding white and black data with 1 or 0 as much as possible. Then? This was done from the assumption. This was the decisive hit that was able to climb to 1st place.
It's a subtle point, but I'm glad I worked on it because it's a beginner-only competition and the ranking doesn't really matter.
I feel that this kind of sloppy data analysis cannot withstand the practical level, so I would like to be able to work harder.
Recommended Posts