Overview

The theme is "What is machine learning? How do you use it?" For in-house study sessions. I hope the content of this article is useful to others.

Machine learning = AI?

Machine learning is a field of artificial intelligence, and deep learning is a field of machine learning.

Rule base

A program that covers various patterns by multiple If statements and exploration so that appropriate output can be obtained even under complicated conditions.

Machine learning

Learns data patterns and features and outputs some predictions for unknown data based on it

Deep learning

One of the machine learning methods that can automatically select the elements that characterize the data

Reinforcement learning

In a certain environment, the agent repeatedly tries to act while observing the situation and learns the optimal decision making to achieve the purpose.

Point! With rule base, when an exception occurs, it is necessary for a person to manually rewrite the rule, and it is difficult to respond when the data increases steadily. ** → In machine learning, let the computer do it! ** **

The following is based on supervised learning.

Types of machine learning

Supervised learning can be broadly divided into regression and classification. Regression: The prediction result is numerical. What is Japan's GDP in 2018? → Regression Classification: The result of the prediction is a class. Is this flower an iris or a young iris? → Classification

What does machine learning do?

1. Decide what to do

Decide what you want to judge and how accurate you want it to be.

2. Collect data

Collect the data necessary for forecasting and judgment. You can use the data already stored in the DB or get it from the WEB. The collected data is divided into "learning data" and "test data".

Crawling

It is a technology to download WEB page data based on the URL. Crawling example using python requests:

`crawling.py`


import requests
r = requests.get('https://ja.wikipedia.org/wiki/Python')
r.text

[Database dump] on Wikipedia (https://ja.wikipedia.org/wiki/Wikipedia:%E3%83%87%E3%83%BC%E3%82%BF%E3%83%99%E3 % 83% BC% E3% 82% B9% E3% 83% 80% E3% 82% A6% E3% 83% B3% E3% 83% AD% E3% 83% BC% E3% 83% 89) I was told in the comments that it is better to use this.

Scraping

Technology to extract and process necessary information from downloaded WEB pages Example of scraping using BeautifulSoup of python:

`scraping.py`


from bs4 import BeautifulSoup
soup = BeautifulSoup(r.content, 'html.parser')
soup.find(class_='mw-redirect').string

>>> 'Multi-paradigm'

Get with API

Obtained using the RESTful API published by each service Example of acquiring data with GitHub API and processing it with pandas:

`get_github_data.py`


import requests
import pandas as pd

git_res = requests.get('https://api.github.com/search/repositories?q=language:python+created:2017-07-28&per_page=3')
pd.DataFrame(git_res.json()['items'])[:][['language', 'stargazers_count', 'git_url', 'updated_at', 'created_at']]

3. Format the data

Format the collected data. How to format it depends on the type of data and the subsequent modeling work.

Missing value interpolation

Data that is missing in the features is filled with the average value or 0, and the data is interpolated. It also performs processing such as replacing the characters listed as categories with flags (dummy variable conversion).

trimming

If you want to identify a specific character from the image data, trim it or create annotation data.

Morphological analysis

Sentences, etc. are converted into word-separated writing by morphological analysis, and further vector-converted so that they can be handled as numerical values.

janome example

Installation

pip install janome

Word-separation

`janome_test.py`


# -*- coding: utf-8 -*-
from janome.tokenizer import Tokenizer
t = Tokenizer()

document = u'This is test data'
tokens = t.tokenize(document)
for token in tokens:
    print(token.surface)

output

this
Is
test
data
is

4. Make a model and learn

A model is for converting input data (prediction / judgment factors) to output data (prediction / judgment results). Roughly speaking, a function.

procedure

From the acquired data, organize and analyze the structure and correlation of the data that are likely to be factors in the prediction results, and create a model with a certain degree of freedom. Determine the parameters from the training data and create a prediction model.

image

Feature vector (feature amount): A value that causes prediction
Authenticity label (target): Expected value
Parameters: A value that gives the model a degree of freedom
Predictive model: A model with fixed parameters
Prediction label: Prediction result using prediction model

Library

This is a part that requires specialized knowledge and experience, but there are libraries that can be created to some extent easily.

scikit-learn: Machine learning system for python
statsmodels: Statistics system for python
Spark MLlib: Provided a machine learning library for distributed processing of big data
XGBoost: Library that can create highly accurate ensemble learning models

Example using scikit-learn: Select Model (this time linear regression) that is good for the problem, fit the training data to the model, and select the trained model. Created.

`liner_reg_sample.py`


import numpy as np
from sklearn import linear_model

#Assuming collected data
x_data = np.arange(-3, 10, 0.1).reshape(-1, 1)
y_data = (1/2) * x_data + np.random.normal(0.0, 0.5, len(x_data)).reshape(-1, 1)

#Use as learning data
x_train = x_data[70:]
y_train = y_data[70:]

#Fit the model to the training data
reg = linear_model.LinearRegression()
reg.fit(x_train, y_train)

There is a train_test_split to separate the training data from the test data. (I didn't use it this time to make it easier to read chapter by chapter.)

Hyperparameters

Some models have hyperparameters that need to be manually determined. This is not determined by learning. (Example: number of DL layers, number of learnings, etc.)

How to determine hyperparameters

Grid search: How to determine the range of hyperparameters and try all combinations at regular intervals
Random search: How to try random combinations in a hyperparameter that follows a certain distribution

Other

I previously posted an article that outlines machine learning techniques. I hope it will be a hint for model building. Roughly organize Qiita machine learning information centered on methods

5. Predicted by test data

Use the created model and make predictions with test data.

An example using scikit-learn (continuation of the above code):

`liner_reg_sample.py`


#test data
x_test = x_data[:71]
y_test = y_data[:71]

#Forecast
pred = reg.predict(x_test)

#Coefficient of determination
print('score:', reg.score(x_test, y_test))

>>> score: 0.714080213722

6. Verification

Verify how accurate the predictions in the test data are. Validate with the evaluation scale suitable for each model.

Evaluation scale

Classification

Correct answer rate (accuracy): Percentage of correct answers in the whole
Precision: Percentage of hoge class predicted to be actually hoge class
Recall: Percentage of data that is actually a hoge class that is predicted to be a hoge class
F value (f1-score): Harmonic mean of precision and recall

Output these to scikit-learn accuracy_score and classification_report /stable/modules/generated/sklearn.metrics.classification_report.html).

Regression

Average Absolute Error (MAE): The average of the absolute values of the difference between the predicted value and the correct answer value.
Mean squared error (MSE): The mean square of the difference between the predicted value and the correct answer value
Root mean square error square root (RMSE): Root mean square error square root

Output these to scikit-learn mean_absolute_error and mean_squared_error /stable/modules/generated/sklearn.metrics.mean_squared_error.html).

Verification method

Holdout verification: How to divide the data into training data and test data at a certain ratio and verify image
K-fold cross-validation: Divide the data into K pieces, one as test data and the other as training data. There are K ways to select test data, so it is a method to verify all combinations and evaluate with average accuracy. Image (K = 3)

An example using scikit-learn (continuation of the above code):

`liner_reg_sample.py`


from sklearn.metrics import mean_squared_error
from math import sqrt

#Correlation coefficient
print('corr:', np.corrcoef(y_test.reshape(1, -1), pred.reshape(1, -1))[0, 1])

# RMSE
print('RMSE:', sqrt(mean_squared_error(y_test, pred)))

>>> corr: 0.895912443712
>>> RMSE: 0.6605862235679646

If it can be visualized, graph it and check it visually.

`liner_reg_sample.py`


plt.scatter(x_test, y_test, color='blue')
plt.plot(x_test, pred, color='red')
plt.show()

ダウンロード.png

Reporting

Record what kind of modeling was done, what test data was used, and how accurate it was. Since I am writing in python, I write it in Markdown in Jupyter notebook.

Qiita: Various summary to use Jupyter Notebook more conveniently

7. Back

If the accuracy required in the verification is not obtained, sort out what went wrong and follow the procedure "2. Collect data" or "3. Shape the data" or "4. Create a model and learn" Return to. Rotate this cycle around.

keyword

Overfitting

It adapts excessively to the training data, and the accuracy of prediction for unknown data becomes low.

Specialized AI / general-purpose AI

Specialized AI is AI that can be used in a specific field. General-purpose AI is AI that can be used in many different fields. (It's like Tetsuwan A * mu) Most AI is specialized AI.

end

Since the study session was only an overview, I posted Qiita with the hope that I could share knowledge with a little information added. I would be grateful if you could point out any mistakes.

Super introduction to machine learning

Overview

Machine learning = AI?

Rule base

Machine learning

Deep learning

Reinforcement learning

Types of machine learning

What does machine learning do?

1. Decide what to do

2. Collect data

Crawling

crawling.py

Scraping

scraping.py

Get with API

get_github_data.py

3. Format the data

Missing value interpolation

trimming

Morphological analysis

janome example

janome_test.py

4. Make a model and learn

procedure

Library

liner_reg_sample.py

Hyperparameters

Other

5. Predicted by test data

liner_reg_sample.py

6. Verification

Evaluation scale

Classification

Regression

Verification method

liner_reg_sample.py

liner_reg_sample.py

Reporting

7. Back

keyword

Overfitting

Specialized AI / general-purpose AI

end

`crawling.py`

`scraping.py`

`get_github_data.py`

`janome_test.py`

`liner_reg_sample.py`

`liner_reg_sample.py`

`liner_reg_sample.py`

`liner_reg_sample.py`