I studied with Kaggle Start Book on the subject of kaggle [Part 1]

1. Purpose

In a previous post, I used kaggle's Kickstarter Projects to verify the accuracy of each model. [I tried to compare the accuracy of machine learning models using kaggle as a theme] (https://qiita.com/Hawaii/items/4f0dd4d9cfabc4f6bb38)

This time, while referring to that as well, ** the purpose is to disseminate the process of improving accuracy using the Kaggle Start Book released in March **. This time I learned a lot of new things, so I decided to divide it into the first part and the second part. Today's post is the first part.

◆ Subjects to be dealt with

Mainly, I would like to focus on the parts that I did not know. Specifically, there are three types: ** "Pandas-Profiling, LightGBM, Ensemble Learning" **.

→ This time, we will work on Pandas-Profiling and LightGBM.

◆ Others

It also describes the problems with each subject, the error, and the process of solving it, so if you did not go well, please read it.

2. Data analysis-Pandas Profiling-

Pandas Profiling was introduced in the Kaggle startbook, and I didn't know it at all, so I tried it.

(1) Installation

I referred to the following site. https://qiita.com/h_kobayashi1125/items/02039e57a656abe8c48f

It seems that Pandas-Profiling needs to be installed with pip etc., so I also imported it.

pip install pandas-profiling

-If you can install without any problems above, please skip it-

Basically, this seems to be OK, but in fact I got an error here, so I searched various sites and tried it. However, I couldn't get out of this error and it took me a day, so I will write about my experience. So, for those who have followed the exact same path as me, I would like to mention that I was able to solve it.

Here is the reference. https://gammasoft.jp/support/pip-install-error/

I hit cause 2 of this, so

pip install pandas-profiling --user

I wrote that and tried to install it. Then, although it was not an error, a warning appeared in red, and once I closed the jupyter notebook and started it up again, it did not start up ...

The error I was getting was "Attribute Error:'module' object has no attribute's'", so I checked this as well, and when I did the following, I was able to launch it again safely!

pip uninstall attr
pip install attrs
pip intall pandas-profiling

I don't have any knowledge about this, so I'm sorry for the fluffy writing ... I hope it helps you a little.

(2) Import what you need

There are some that I don't need directly this time, but I'm importing them all at once.

#numpy,Import pandas
import numpy as np
import pandas as pd

#Import to perform some processing on date data
import datetime

#Import for training and test data split
from sklearn.model_selection import train_test_split

#Import for standardization
from sklearn.preprocessing import StandardScaler

#Import for accuracy verification
from sklearn.model_selection import cross_val_score

#pandas_profiling
import pandas_profiling as pdp

(3) Data reading

df = pd.read_csv(r"~\ks-projects-201801.csv")

As it is written in the start book, pandas-profiling takes time if it is a huge amount of data, so we will sample the data.

#Sampling 30% of the whole
df_sample=df.sample(frac=0.3,random_state=1234)

(4) Execution of Pandas-Profiling

Originally, ``` df_sample.profile_report ()` `` seems to be OK, but for some reason I did not get an error, but the result is not displayed and I spit it out in HTML format in this way I tried (I think that the file will be created in HTML format in the same place where I am working).

report = pdp.ProfileReport(df_sample)
report.to_file('profile_report.html')

I was able to implement it safely! !! Certainly, this seems to give you a rough idea of what the data looks like. I would like to continue using it as appropriate.

キャプチャ1.PNG

3. Model construction-LightGBM-

Next, let's build a model of LightGBM. I've heard the name, but it was implemented because there wasn't much description in books etc. I didn't have it, so I'll try it.

◆ Reference site In addition to the Kaggle Startbook, I also referred to the following sites.

https://blog.amedama.jp/entry/2018/05/01/081842#scikit-learn-%E3%82%A4%E3%83%B3%E3%82%BF%E3%83%BC%E3%83%95%E3%82%A7%E3%83%BC%E3%82%B9

(1) Import what you need

#numpy,Import pandas
import numpy as np
import pandas as pd

#Import to perform some processing on date data
import datetime

#Import for training and test data split
from sklearn.model_selection import train_test_split

#Import for standardization
from sklearn.preprocessing import StandardScaler

#Import for accuracy verification
from sklearn.model_selection import cross_val_score

#pandas_profiling
import pandas_profiling as pdp

#LightGBM import
import lightgbm as lgb

(2) Data reading

df = pd.read_csv(r"C:\\ks-projects-201801.csv")

(3) Pretreatment

I won't go into details, but the code comments describe what I'm doing.

#The state of the objective variable is narrowed down to data only for success or failure (the data is deleted because there is a category such as abort in the middle)
df = df[(df["state"] == "successful") | (df["state"] == "failed")]

#On top of that, success is set to 1 and failure is set to 0.
df["state"] = df["state"].replace("failed",0)
df["state"] = df["state"].replace("successful",1)

#Processing of date data. start date(launched)And end date(deadline)Because there is, take the difference and recruitment period(days)I have to
df['deadline'] = pd.to_datetime(df["deadline"])
df["launched"] = pd.to_datetime(df["launched"])
df["days"] = (df["deadline"] - df["launched"]).dt.days


#Although omitted this time, as a result of data analysis, explanatory variables that seem unnecessary are deleted.
df = df.drop(["ID","name","deadline","launched","backers","pledged","usd pledged","usd_pledged_real","usd_goal_real"], axis=1)

#Category variable processing
df = pd.get_dummies(df,drop_first = True)

(4) Data division

y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

(5) Implementation of LightGBM

(I) Data set creation

lgb_train = lgb.Dataset(X_train,y_train)
lgb_eval = lgb.Dataset(X_test,y_test)

params = {"objective":"binary"}

(Ii) Model construction

model = lgb.train(params,lgb_train,valid_sets=[lgb_train,lgb_eval],verbose_eval=10,num_boost_round=1000,early_stopping_rounds=10)

(Iii) Accuracy verification

#Predict with test data, y_Store results in pred
y_pred = model.predict(X_test,num_iteration=model.best_iteration)
#y_pred is 0.If it is greater than 5, make it an integer 1.
y_pred  = (y_pred>0.5).astype(int)

y_pred_max = np.argmax(y_pred) 

#accuracy(Accuracy)To calculate
accuracy = sum(y_test == y_pred_max) / len(y_test)
print(accuracy)

Then, the accuracy was ** 0.597469 **, and LightGBM was successfully implemented!

(Iv) Precautions

What I stumbled upon this time was that I didn't write `y_pred = (y_pred> 0.5) .astype (int)`, so the precision was initially 0.

→ It is written well in the book, but I skipped it because I was writing the code while referring to other sites as well.

The LightGBM result is output as a continuous value from 0 to 1, while y_test is 0 or 1 because I initially set 0 for failure and 1 for success. The accuracy was 0 at first because I compared the two purely, but I was able to get the accuracy safely by replacing the value larger than 0.5 with 1.

4. Conclusion

What did you think. I didn't know pandas-profiling at all, so I think it could be used for data analysis. This time I was able to implement LightGBM, which I had been interested in for the first time, so I hope it will be helpful for super beginners as well.

Next time, I will try ensemble learning.

Recommended Posts

I studied with Kaggle Start Book on the subject of kaggle [Part 1]
Until the start of the django tutorial with pycharm on Windows
I wrote the basic operation of Pandas with Jupyter Lab (Part 1)
I tried running the DNN part of OpenPose with Chainer CPU
I checked the image of Science University on Twitter with Word2Vec.
I wrote the basic operation of Pandas with Jupyter Lab (Part 2)
Post the subject of Gmail on twitter
I tried playing with the calculator on tkinter
I installed Pygame with Python 3.5.1 in the environment of pyenv on OS X
I tried object detection with YOLO v3 (TensorFlow 2.1) on the GPU of windows!
I want to plot the location information of GTFS Realtime on Jupyter! (With balloon)
Maybe I overestimated the impact of ShellShock on CGI
I measured the performance of 1 million documents with mongoDB
I tried to erase the negative part of Meros
I tried to find the entropy of the image with python
I tried "gamma correction" of the image with Python + OpenCV
I tried to find the average of the sequence with TensorFlow
I wrote the basic grammar of Python with Jupyter Lab
Let's execute the command on time with the bot of discord
I evaluated the strategy of stock system trading with Python.
I implemented the FloodFill algorithm with TRON BATTLE of CodinGame.
I made a dot picture of the image of Irasutoya. (part1)
I made a dot picture of the image of Irasutoya. (part2)
I wrote the basic operation of matplotlib with Jupyter Lab
Get the host name of the host PC with Docker on Linux
Get images of great find / 47 sites using Python (Part 2/2: I published the target list on github)
Read the coordinates of the plot on the graph with Python-matplotlib (super beginner)
PySpark learning record ② Kaggle I tried the Titanic competition with PySpark binding
I compared the speed of Hash with Topaz, Ruby and Python
I tried scraping the ranking of Qiita Advent Calendar with Python
When I start the virtual environment of conda, the prompt of bash collapses
[AWS / Tello] I tried operating the drone with my voice Part2
I tried to solve the ant book beginner's edition with python
I tried to automate the watering of the planter with Raspberry Pi
[Python] I wrote the route of the typhoon on the map using folium
I tried cross-validation based on the grid search results with scikit-learn
[Introduction to StyleGAN] I played with "The Life of a Man" ♬
I want to output the beginning of the next month with Python
I analyzed the tweets about the new coronavirus posted on Twitter Part 2
Count the maximum concatenated part of a random graph with NetworkX
I tried to get started with Bitcoin Systre on the weekend
[Required subject DI] Implement and understand the mechanism of DI with Go
[AWS / Tello] I tried operating the drone with my voice Part1
I tried to expand the size of the logical volume with LVM
I want to check the position of my face with OpenCV!
I tried to improve the efficiency of daily work with Python
[Kaggle] I made a collection of questions using the Titanic tutorial
How to run the practice code of the book "Creating a profitable AI with Python" on Google Colaboratory
Start data science on the cloud
I investigated the mechanism of flask-login!
Python3 compatible memo of "python start book"
I liked the tweet with python. ..
Get the width of the div on the server side with Selenium + PhantomJS + Python
I replaced the numerical calculation of Python with Rust and compared the speed
[Spotify API] Looking back on 2020 with playlists --Part.1 Acquisition of playlist data
How to crop the lower right part of the image with Python OpenCV
Edit the file of the SSH connection destination server on the server with VS Code
I tried to get the authentication code of Qiita API with Python.
I vectorized the chord of the song with word2vec and visualized it with t-SNE
I measured the run queue wait time of a process on Linux
Play the comment of Nico Nico Douga on the terminal in conjunction with the video