Machine learning Training data division and learning / prediction / verification

When studying or teaching machine learning based on PyData.Tokyo Tutorial # 1, from the division of training data, I find it difficult to understand the learning, prediction, and verification parts. I will explain this part.

Prerequisites

--Supervised learning-> In other words, there is labeled data --There are a certain number of datasets-> 890 in this tutorial --Learning and verifying with 20% of test data left --The feature matrix is multidimensional (it is natural ...) --Use sklearn (scikit-learn) --Estimate by logistic regression --See pydatatokyo_tutorial_ml.ipynb in PyData.Tokyo Tutorial # 1 for detailed code.

Training data split

Feature matrix X Class label data y If you do the following, you can divide the data.

from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.8, random_state=1)

機械学習データの分割2.png

--X_train: Feature matrix for learning (80%) --X_val: Evaluation feature matrix (20%) --y_train: Training class label (80%) Unknown data --y_val: Evaluation class label (20%) Used for answering unknown data (keep it hidden)

Learning / prediction / verification

Initialization of classifier (learner)

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()

Initialize clf and use it for the following learning, prediction, and verification.

Learning

clf.fit(X_train, y_train)

Train using the initialized clf fit method The data gives 80% of the training data a feature matrix and class labels

Forecast

y_train_pred = clf.predict(X_train)
y_val_pred = clf.predict(X_val)

Predict with clf's predict method.

--y_train_pred: Result of re-prediction with training data --y_val_pred: Result of prediction using evaluation data

So far, I haven't used y_val. That is, y_train is treated as unknown data

Evaluation / verification

from sklearn.metrics import accuracy_score
train_score = accuracy_score(y_train, y_train_pred)
val_score = accuracy_score(y_val, y_val_pred)

ʻAccuracy_score is given class label dataandpredicted result` above, and the correct answer rate is output.

--train_score: Results of prediction using training data --val_score: As a result of making a prediction using evaluation data, it means that a prediction was made using unknown data.

Recommended Posts

Machine learning Training data division and learning / prediction / verification

Time series data prediction by AutoML (automatic machine learning)

Data set for machine learning

Machine learning and mathematical optimization

How to split machine learning training data into objective variables and others in Pandas

Performance verification of data preprocessing for machine learning (numerical data) (Part 2)

Pre-processing in machine learning 3 Missing values, outliers, and imbalanced data

Performance verification of data preprocessing for machine learning (numerical data) (Part 1)

Significance of machine learning and mini-batch learning

Classification and regression in machine learning

Organize machine learning and deep learning platforms

Machine learning in Delemas (data acquisition)

Preprocessing in machine learning 2 Data acquisition

Preprocessing in machine learning 4 Data conversion

Basic machine learning procedure: ② Prepare data

How to collect machine learning data

[Machine learning] OOB (Out-Of-Bag) and its ratio

Machine learning imbalanced data sklearn with k-NN

[Machine learning] FX prediction using decision trees

Machine learning

Python data structure and operation (Python learning memo ③)

[Python] First data analysis / machine learning (Kaggle)

One-click data prediction for the field realized by fully automatic machine learning

Machine learning algorithm classification and implementation summary

Python and machine learning environment construction (macOS)

Python: Preprocessing in machine learning: Data conversion

"OpenCV-Python Tutorials" and "Practical Machine Learning System"

Python: Preprocessing in machine learning: Handling of missing, outlier, and imbalanced data

Preprocessing in machine learning 1 Data analysis process

Summary of mathematical scope and learning resources required for machine learning and data science

Until launching a boat race triple prediction site using machine learning and Flask

I tried to process and transform the image and expand the data for machine learning

Machine Learning with docker (40) with anaconda (40) "Hands-On Data Science and Python Machine Learning" By Frank Kane

Study machine learning and computer science. Resource list

Data supply tricks using deques in machine learning

Training data and test data (What are X_train and y_train?) ②

Numerai Tournament-Fusion of Traditional Quants and Machine Learning-

[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-

[Python3] Let's analyze data using machine learning! (Regression)

I started machine learning with Python Data preprocessing

A story about data analysis by machine learning

Collect machine learning training image data on your own (Google Custom Search API Pikachu)

Creating training data

Predicting offensive and defensive attributes from the Yu-Gi-Oh! Card name --Yu-Gi-Oh! Data Science 3. Machine Learning

Collect machine learning training image data on your own (Tumblr API Yoshioka Riho ed.)

[Machine learning] Where will you win this year's Hakone Ekiden? ~ From data to prediction ~

[Memo] Machine learning

Machine learning classification

Machine Learning sample

Machine learning with Raspberry Pi 4 and Coral USB Accelerator

Relationship data learning with numpy and NetworkX (spectral clustering)

Easy machine learning with scikit-learn and flask ✕ Web app

Python learning memo for machine learning by Chainer Chapters 1 and 2

Machine learning #k-nearest neighbor method and its implementation and various

[PyTorch Tutorial ⑦] Visualizing Models, Data, And Training With Tensorboard

Use scikit-learn training dataset with chainer (for learning / prediction)

Machine learning engineer lawyer explains AI and rights story

Artificial intelligence, machine learning, deep learning to implement and understand

Practical machine learning with Scikit-Learn and TensorFlow-TensorFlow gave up-

xgboost: A valid machine learning model for table data

Set up python and machine learning libraries on Ubuntu