The theme is "What is machine learning? How do you use it?" For in-house study sessions. I hope the content of this article is useful to others.
Machine learning is a field of artificial intelligence, and deep learning is a field of machine learning.
A program that covers various patterns by multiple If statements and exploration so that appropriate output can be obtained even under complicated conditions.
Learns data patterns and features and outputs some predictions for unknown data based on it
One of the machine learning methods that can automatically select the elements that characterize the data
In a certain environment, the agent repeatedly tries to act while observing the situation and learns the optimal decision making to achieve the purpose.
Point! With rule base, when an exception occurs, it is necessary for a person to manually rewrite the rule, and it is difficult to respond when the data increases steadily. ** → In machine learning, let the computer do it! ** **
Supervised learning can be broadly divided into regression and classification. Regression: The prediction result is numerical. What is Japan's GDP in 2018? → Regression Classification: The result of the prediction is a class. Is this flower an iris or a young iris? → Classification
Decide what you want to judge and how accurate you want it to be.
Collect the data necessary for forecasting and judgment. You can use the data already stored in the DB or get it from the WEB. The collected data is divided into "learning data" and "test data".
It is a technology to download WEB page data based on the URL. Crawling example using python requests:
crawling.py
import requests
r = requests.get('https://ja.wikipedia.org/wiki/Python')
r.text
Technology to extract and process necessary information from downloaded WEB pages Example of scraping using BeautifulSoup of python:
scraping.py
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.content, 'html.parser')
soup.find(class_='mw-redirect').string
>>> 'Multi-paradigm'
Obtained using the RESTful API published by each service Example of acquiring data with GitHub API and processing it with pandas:
get_github_data.py
import requests
import pandas as pd
git_res = requests.get('https://api.github.com/search/repositories?q=language:python+created:2017-07-28&per_page=3')
pd.DataFrame(git_res.json()['items'])[:][['language', 'stargazers_count', 'git_url', 'updated_at', 'created_at']]
Format the collected data. How to format it depends on the type of data and the subsequent modeling work.
Data that is missing in the features is filled with the average value or 0, and the data is interpolated. It also performs processing such as replacing the characters listed as categories with flags (dummy variable conversion).
If you want to identify a specific character from the image data, trim it or create annotation data.
Sentences, etc. are converted into word-separated writing by morphological analysis, and further vector-converted so that they can be handled as numerical values.
Installation
pip install janome
Word-separation
janome_test.py
# -*- coding: utf-8 -*-
from janome.tokenizer import Tokenizer
t = Tokenizer()
document = u'This is test data'
tokens = t.tokenize(document)
for token in tokens:
print(token.surface)
output
this
Is
test
data
is
A model is for converting input data (prediction / judgment factors) to output data (prediction / judgment results). Roughly speaking, a function.
From the acquired data, organize and analyze the structure and correlation of the data that are likely to be factors in the prediction results, and create a model with a certain degree of freedom. Determine the parameters from the training data and create a prediction model.
image
This is a part that requires specialized knowledge and experience, but there are libraries that can be created to some extent easily.
Example using scikit-learn: Select Model (this time linear regression) that is good for the problem, fit the training data to the model, and select the trained model. Created.
liner_reg_sample.py
import numpy as np
from sklearn import linear_model
#Assuming collected data
x_data = np.arange(-3, 10, 0.1).reshape(-1, 1)
y_data = (1/2) * x_data + np.random.normal(0.0, 0.5, len(x_data)).reshape(-1, 1)
#Use as learning data
x_train = x_data[70:]
y_train = y_data[70:]
#Fit the model to the training data
reg = linear_model.LinearRegression()
reg.fit(x_train, y_train)
There is a train_test_split to separate the training data from the test data. (I didn't use it this time to make it easier to read chapter by chapter.)
Some models have hyperparameters that need to be manually determined. This is not determined by learning. (Example: number of DL layers, number of learnings, etc.)
How to determine hyperparameters
I previously posted an article that outlines machine learning techniques. I hope it will be a hint for model building. Roughly organize Qiita machine learning information centered on methods
Use the created model and make predictions with test data.
An example using scikit-learn (continuation of the above code):
liner_reg_sample.py
#test data
x_test = x_data[:71]
y_test = y_data[:71]
#Forecast
pred = reg.predict(x_test)
#Coefficient of determination
print('score:', reg.score(x_test, y_test))
>>> score: 0.714080213722
Verify how accurate the predictions in the test data are. Validate with the evaluation scale suitable for each model.
Output these to scikit-learn accuracy_score and classification_report /stable/modules/generated/sklearn.metrics.classification_report.html).
Output these to scikit-learn mean_absolute_error and mean_squared_error /stable/modules/generated/sklearn.metrics.mean_squared_error.html).
Holdout verification: How to divide the data into training data and test data at a certain ratio and verify image
K-fold cross-validation: Divide the data into K pieces, one as test data and the other as training data. There are K ways to select test data, so it is a method to verify all combinations and evaluate with average accuracy. Image (K = 3)
An example using scikit-learn (continuation of the above code):
liner_reg_sample.py
from sklearn.metrics import mean_squared_error
from math import sqrt
#Correlation coefficient
print('corr:', np.corrcoef(y_test.reshape(1, -1), pred.reshape(1, -1))[0, 1])
# RMSE
print('RMSE:', sqrt(mean_squared_error(y_test, pred)))
>>> corr: 0.895912443712
>>> RMSE: 0.6605862235679646
If it can be visualized, graph it and check it visually.
liner_reg_sample.py
plt.scatter(x_test, y_test, color='blue')
plt.plot(x_test, pred, color='red')
plt.show()
Record what kind of modeling was done, what test data was used, and how accurate it was. Since I am writing in python, I write it in Markdown in Jupyter notebook.
Qiita: Various summary to use Jupyter Notebook more conveniently
If the accuracy required in the verification is not obtained, sort out what went wrong and follow the procedure "2. Collect data" or "3. Shape the data" or "4. Create a model and learn" Return to. Rotate this cycle around.
It adapts excessively to the training data, and the accuracy of prediction for unknown data becomes low.
Specialized AI is AI that can be used in a specific field. General-purpose AI is AI that can be used in many different fields. (It's like Tetsuwan A * mu) Most AI is specialized AI.
Since the study session was only an overview, I posted Qiita with the hope that I could share knowledge with a little information added. I would be grateful if you could point out any mistakes.
Recommended Posts