The Machine Learning Landscape
Why Use Machine Learning? Programming a spam filter is the next step.
Spam detection programs are difficult to maintain because they contain many rules, but when machine learning is used to create spam filters, the program is easy to keep short. And more accurate.
When writing a program, spammers can use different words to circumvent the spam filter if they know which word is used to detect spam. When writing a program, you have to add new rules, but for machine learning-based spam filters, it can be done automatically.
Types of Machine Learning Systems
The distinction between supervised and unsupervised learning, the distinction between batch learning and offline learning, and the distinction between instance-based and model-based learning are mentioned.
Supervised/Unsupervised Learning -In supervised learning, the training data includes labels. • A typical supervised learning task is classification, for example spam filters. -Another task is forecasting, for example, forecasting the price of a car.
Method ・ K-nearest neighbor method ・ Linear regression ・ Logistic regression ・ Support vector machine ・ Decision tree and random forest ·neural network
Unsupervised learning • In unsupervised learning, the training data does not include labels. ・ Clustering can be used to find out what kind of group the people who visit the blog have.
Method clustering ・ K-means method ・ DBSCAN ・ Hierarchical clustering
-Visualization aims to plot data in 2D or 3D space, and dimensional deletion aims to simplify the data without reducing a lot of information.
Technique visualization and dimension removal ・ Principal Component Analysis (PCA) ・ Kernel PCA ・ Locally-Linear Embedding (LLE) ・ T-distributed Stochastic Neighbor Embedding (t-SNE)
Anomaly detection is, for example, detecting unusual use of a credit card, and is similar to novelty detection.
Method Anomaly detection and novelty detection ・ One-Class SVM ・ Isolation Forest
・ Association rule learning aims to find interesting relationships from a large amount of data. For example, people who buy barbecue sauce and potato chips tend to buy steak.
Method Association Rule Learning ・ A priori ・ Eclat
Semisupervised learning -A algorithm that handles data with a few labels and data without many labels is called semi-supervised learning. For example, there is a determination as to whether or not the face is the same as the face in another photo in one photo.
Reinforcement Learning Reinforcement learning involves agents observing the environment, choosing actions, performing actions, and receiving rewards to learn policies (definition of how to behave in a given situation). For example, when a robot learns how to walk, it is used in AlphaGO.
Batch and Online Learning Batch learning ・ Those who cannot relearn are called offline learning.
Online learning -In online learning, the system can be trained on a regular basis by giving data in order. Online learning is used when the dataset is large. ・ The name online learning is misleading, so it is better to think of it as incremental learning. -The problem is that if the system is given bad data, the performance of the system will decrease. Therefore, it is necessary to monitor the system.
Instance-Based Versus Model-Based Learning Instance-based learning ・ Instant-based learning is the generalization of new examples by comparing learned examples with new ones based on a measure of similarity. For example, spam filtering. (It seemed like unsupervised learning, but it also includes K-nearest neighbors)
Model-based learning -Using a model to make predictions by constructing a model of data is called model-based learning. For example, do a regression analysis of each country's life satisfaction and GDP based on the hypothesis that money makes people happy.
Main Challenges of Machine Learning Important points in machine learning ・ Infants can recognize "apples" simply by pointing their fingers and saying "apples," but machine learning does not. 1000, 1,000,000 data is required. -If the sampling is bad even for a very large data set, it is not typical data. -If the training data is flooded with errors, outliers, and noise, it will be difficult for the system to detect the pattern, so preprocessing is required. ・ Feature engineering is important ・ Overfitting is dangerous, normalization is done by simplifying the model to prevent overfitting, and hyperparameters are adjusted. ・ Underfitting is the opposite of overfitting. Caused by the model being too simple.
Testing and Validating • Split the data, 80% for training and 20% for testing, to see if the model is working. • Train and compare both models to find out what to do if you are at a loss between two models (eg linear regression or K-nearest neighbors). ・ How should we regularize hyperparameters? There is a way to say that 100 models use 100 different hyperparameters. ・ Holdout verification
Data Mismatch -Although data can be easily obtained, it may not represent the data used in production.
End-to-End Machine Learning Project
Published dataset — UC Irvine Machine Learning Repository — Kaggle datasets — Amazon’s AWS datasets • Meta portals (they list open data repositories): — http://dataportals.org/ — http://opendatamonitor.eu/ — http://quandl.com/ • Other pages listing many popular open data repositories: — Wikipedia’s list of Machine Learning datasets — Quora.com question — Datasets subreddit
Evaluation function of regression analysis ・ Square root of mean square error (RMSE) ・ Mean squared error (RMSE)
Data snooping bias If you overlook the characteristics of your data, you will come up with overfitting hypotheses, so it's best to decide which algorithm to use in moderation.
Recommended Posts