How to read original data or external data on the Internet with scikit-learn instead of attached data set such as iris

Many books and scikit-learn teaching materials on the Internet use attached datasets such as iris and cancer. Of course, there is a sense of security that the same results can be obtained easily, but I think that many people find it difficult to obtain deep learning in harmony with the schedule. In this article, I will introduce how to read your own data or external data on the net and analyze it with scikit-learn. (Verification environment: Windows10, Anaconda3, Python3.7.6, Jupyter Notebook 6.0.3) First draft released 2020/3/23

CSV file preparation

In this article, as an example, the data published by the machine learning / data science community Kaggle World Happiness Report ← link .com / unsdsn / world-happiness) is used. I chose Kaggle because it requires user registration because it has a dataset that is easy to use for machine learning. Please download from the button Download (79 KB). If you unzip the zip file, you will find 5 CSV files, but here we will use 2019.csv.

When using other files

--2019.csv is arranged as "feature name on the first line, data on the second and subsequent lines" so that it can be easily read by the Python data analysis library ** pandas **. For other data, please mold by deleting the line of Excel.

--If the format is different, such as Excel file (.xls), read it with Excel, etc., then perform "File-Save As" and select the CSV file format. If the delimiter can be selected, leave it as, (comma).

--It is easier to save the file in the folder where the Python executable file (py or ipynb file) is located.

Read CSV file

You can load it directly without using the library, but to make the rest of the process easier, this article will show you how to use pandas. (If pandas is not already installed, [this article] Please refer to (https://www.sejuku.net/blog/75508) etc. )

import pandas as pd
df = pd.read_csv('2019.csv')

If the delimiter is a tab, add the argument sep ='\ t', and if it contains Japanese, add the argument encoding ='shift_jis'. df = pd.read_csv ('filename.csv', sep ='\ t', encoding ='shift_jis')

If you want to put the data file in a different location from the executable file, add df = pd.read_csv ('data / 2019.csv') and a relative path. Reference → Mutual conversion / judgment of absolute path and relative path with Python, pathlib

Confirmation of feature name and number of data

print("Confirmation of dataset key (feature amount name)==>:\n", df.keys())
print('Check the number of rows and columns in the dataframe==>\n', df.shape)

When you run the above command,

Confirmation of dataset key (feature amount name)==>:
 Index(['Overall rank', 'Country or region', 'Score', 'GDP per capita',
       'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption'],
      dtype='object')
Check the number of rows and columns in the dataframe==>
 (156, 9)

It can be confirmed that the data of 156 samples with 9 features can be read.

Process missing values, etc. (feature engineering)

Check if there are any missing values (Null) in the data, and check the data type to see if it is an integer value only (int) / a number including a decimal (float) / a character string or a mixture of a character string and a numerical value (object). I will.

#dataframe Check the number of non-missing data and data type in each column
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 9 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Overall rank                  156 non-null    int64  
 1   Country or region             156 non-null    object 
 2   Score                         156 non-null    float64
 3   GDP per capita                156 non-null    float64
 4   Social support                156 non-null    float64
 5   Healthy life expectancy       156 non-null    float64
 6   Freedom to make life choices  156 non-null    float64
 7   Generosity                    156 non-null    float64
 8   Perceptions of corruption     156 non-null    float64
dtypes: float64(7), int64(1), object(1)
memory usage: 11.1+ KB

You can confirm that there are 156 non-null data ⇒ ** no missing values **. Overall rank is an integer, Country or region is a character string, and other numbers are numbers including decimals, and the data type is as intended. If the column that should contain the numbers has become an object,

#Extraction of non-numeric type elements
objectlist = df[['Enter the feature name']][df['Enter the feature name'].apply(lambda s:pd.to_numeric(s, errors='coerce')).isnull()]
objectlist

By executing ↑, you can extract the data that is treated as a character string.

There was no mixture of character strings and numbers or missing values in this data, but for various reasons, "blanks", "characters / symbols other than" 0 "meaning zero", and "numbers with units" In many cases, the analysis contains inappropriate values as they are.

Please refer to this article etc. and perform appropriate processing (feature amount engineering).

Create an object (empty dataset) with a data class for Scikit-learn

import sklearn
worldhappiness = sklearn.utils.Bunch()

Change the world happiness part to represent the dataset name.

Put data in the dataset

# 'Score'(Happiness score)The objective variable'target'To
worldhappiness['target'] = df['Score']
#Explanatory variable'data'Put in
worldhappiness['data'] = df.loc[:, ['GDP per capita',
       'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption']]

↑ Specify 6 columns other than the first 3 columns (features not used for analysis of objective variables and IDs). It is easy to copy and paste the data output by "Confirmation of feature name".

#If you include the name of the feature, you can use it for the legend of the graph (it is not necessary).
worldhappiness['feature_names'] = ['GDP per capita',
       'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption']

Divided into training set and test set

#Divided into training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    worldhappiness['data'], worldhappiness['target'], random_state=0)
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
X_train shape: (117, 6)
X_test shape: (39, 6)

It was divided into 117 training data and 39 test data. (6 is the number of explanatory variable items)

Conclusion

I think this will advance to mechanical analysis. If you have any mistakes or questions, please feel free to comment.

Recommended Posts

How to read original data or external data on the Internet with scikit-learn instead of attached data set such as iris
Compress variables such as DataFrame with joblib instead of pickle to read and write
Set information such as length on the edge of NetworkX
[Image recognition] How to read the result of automatic annotation with VoTT
How to read problem data with paiza
Summary of how to read numerical data with python [CSV, NetCDF, Fortran binary]
[Linux] How to read .bashrc of general user as root user on WSL2 Ubuntu20.04
Checklist on how to avoid turning the elements of numpy's array with for
Python C / C ++ Extensions: Pass some of the data as np.array to Python (set stride)
[Introduction to Python] How to get the index of data with a for statement
How to set a shared folder with the host OS in CentOS7 on VirtualBOX
[Python] How to calculate the approximation formula of the same intercept 0 as Excel [scikit-learn] Memo
A memo on how to overcome the difficult problem of capturing FX with AI
How to use xgboost: Multi-class classification with iris data
How to visualize the decision tree model of scikit-learn
[Blender] How to dynamically set the selection of EnumProperty
How to install git on Linux such as EC2
How to deal with the problem that pandas 1.1.0 or later build fails on Alpine Linux
How to register the same data multiple times with one input on the Django management screen
How to calculate the sum or average of time series csv data in an instant
How to know the number of GPUs from python ~ Notes on using multiprocessing with pytorch ~
read the tag assigned to you on ec2 with boto3
How to enable Read / Write of net.Conn with context with golang
How to register a package on PyPI (as of September 2017)
[Introduction to Python] How to get data with the listdir function
Try to extract the features of the sensor data with CNN
For those of you who don't know how to set a password with Jupyter on Docker
How to intercept or tamper with the SSL communication of the actual iOS device by a proxy