Kaggle's initial equipment-one cell that summarizes frequently used codes-

Introduction

When working on Kaggle [^ Kaggle], a data science competition platform, Python [^ Python] written in the execution environment Notebook [^ Notebook](or Jupyter Notebook [^ Jupyter Notebook] or Google Colaboratory [^ Google Colaboratory]) ], I think there is a "code that is used frequently". In order to reuse this, I will introduce it in one cell as a format that is easy to copy and paste anyway. We hope that by reusing the contents of this cell, we can improve the efficiency of data analysis in Kaggle. (This is "Kaggle's initial equipment"!)

Conclusion

What is Kaggle's initial equipment?

This cell!

Kaggle's initial equipment-one cell that summarizes frequently used codes-.py


#View Notebook
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#A convenient thing
import tqdm
import warnings
warnings.simplefilter('ignore')

#Data manipulation
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)

#Data visualization
import pandas_profiling as pdp
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

#Preprocessing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

#Cross-validation
from sklearn.model_selection import StratifiedKFold

#Definition of constants
SEED = 2019
N_FOLDS = 10

#Output a list of file names given as input
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Commentary

I will explain each point about the above cells.

View Notebook

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Notebook does not output charts that you try to output in the middle of a cell (for example, pandas DataFrame). By using ʻInteractiveShell` [^ InteractiveShell], all the charts that you tried to output in the middle of the cell will be output by the above description. However, it must be executed in advance in a cell different from the cell in which the output is attempted. That is, if you execute this cell, it will be applied in other cells.

A convenient thing

import tqdm
import warnings
warnings.simplefilter('ignore')

In data analysis, there are some processes that require a lot of time. By using tqdm [^ tqdm], you can check the progress of the process on the progress bar. You may also see warnings that do not affect execution. To ignore this, use the module warnings [^ warnings] that handles the display of warnings. Do not display the warning by writing warnings.simplefilter ('ignore').

Data manipulation

import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)

This is the code written as the default in Kaggle's Notebook. We will use two essential modules for data analysis, numpy [^ numpy] for numerical calculation and pandas [^ pandas] for processing matrices. In addition, the description set_option ('display.max_columns', None) sets all columns to be displayed at all times.

Data visualization

import pandas_profiling as pdp
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

By handling pandas_profiling [^ pandas_profiling], you can output the data type and distribution of each feature, missing values, correlation coefficient, etc. collectively. Useful for exploratory data analysis (EDA). matplotlib [^ matplotlib] is a library used for drawing graphs, and you can draw beautiful graphs by using it in combination with seaborn [^ seaborn] that operates as its wrapper. Visualization of data as described above is important for understanding the overall picture of data.

Preprocessing

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

Even in the pre-processing, by using SimpleImputer [^ SimpleImputer], the missing value can be filled with one of the mean value (mean), median value (median), and mode value (mode).

Also, there is sklearn.preprocessing [^ sklearn.preprocessing] as a module implemented for preprocessing. The methods belonging to them can standardize the feature scale by using StandardScaler [^ StandardScaler] as the preprocessing of the numerical value. Normalize the feature scale by using MinMaxScaler [^ MinMaxScaler]. By using LabelEncoder [^ LabelEncoder] as pre-processing of categorical variables, you can convert a character string to an ID expressed numerically. ʻOneHotEncoder` [^ OneHotEncoder] can be used to convert to a vector.

Cross-validation

from sklearn.model_selection import StratifiedKFold 

Cross-validation is done to prevent model overfitting (improve generalization performance). By using Stratified KFold [^ Stratified KFold], which implements stratified K-validation cross-validation, you can split the data for training and testing and verify the accuracy of the model while maintaining the distribution ratio. ..

Definition of constants

SEED = 2019
N_FOLDS = 10

Fix the pseudo-random number seed (SEED) in advance and set the number of cross-validation divisions (N_FOLDS). By assigning these to StratifiedKFold, which is a function that implements cross-validation, the reproducibility of validation is maintained. An example is shown below.

skf = StratifiedKFold(n_splits=N_FOLDS, random_state=SEED)

Output a list of file names given as input

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

This is the code written by default in Kaggle's Notebook. This will first check the given file name.

in conclusion

In this article, we have summarized the "frequently used codes" in Notebook as "initial equipment of Kaggle" and introduced them as one cell. And I explained the points. If you have any suggestions or advice, please do not hesitate to contact us.

(Added on December 03, 2019) It was published in the following article! Thank you! -[Python] Qiita Weekly Stock Number Ranking [Automatic Update] (Updated at 13:00 on December 01, 2019) -Python article summary (automatically updated daily) (Updated at 18:00 on December 02, 2019)

Recommended Posts

Kaggle's initial equipment-one cell that summarizes frequently used codes-
A class that summarizes frequently used methods in twitter api (python)