Start studying: Saturday, December 7th
Teaching materials, etc .: ・ Miyuki Oshige "Details! Python3 Introductory Note ”(Sotec, 2017): Completed on Thursday, December 19th ・ Progate Python course (5 courses in total): Ends on Saturday, December 21st ・ ** Andreas C. Müller, Sarah Guido "(Japanese title) Machine learning starting with Python" (O'Reilly Japan, 2017) **: Completed on Saturday, December 23
Unsupervised Transformation: Creates a more comprehensible data representation for humans and other machine learning algorithms. The most common is dimensionality reduction. Or topic extraction from a set of document data. The latter is useful for analyzing topics on social media. • Clustering Algorithms: Divide data into groups of similar elements. A mechanism for separating photos on SNS sites by person is applicable. ・ It is the only way to find meaning from data when there is no teacher information.
-Since unsupervised learning is given data that does not contain label information at all, humans often have to confirm the results in order to evaluate the results. -Therefore, it is often used exploratory to better understand the data.
・ Neural networks and SVMs for supervised learning are very sensitive to scale transformation. -Standard Scaler: Converts the features so that the average is 0 and the variance is 1. Robust Scaler: Uses the median and quartile instead of the variance of the mean. Ignore outliers. -MinMax Scaler: Converts data so that it falls between 0 and 1. -Normalizer: Projects data points onto a circle with a radius of 1. It is used when only the direction or angle, not the length of the feature vector, matters. -Convert the test set and training set in the same way. ・ Learn and calculate scores after preprocessing.
・ Motivation: Visualization, data compression, discovery of expressions suitable for subsequent processing, etc.
The most commonly used algorithm for all of the above motives -A method of rotating features so that they are not statistically related to each other. ・ Set variance to 1 with Standard Scaler → Apply PCA ・ Explanation of feature extraction using labeled faces in the wild
・ Unsupervised learning aimed at extracting useful features similar to PCA This method of breaking down data into non-negative weighted sums is especially useful for data that is created by superimposing data from multiple independent sources, such as voice data spoken by multiple people.
-It is called a manifold learning algorithm. -Although good visualization can be achieved and complicated mapping can be performed, new data cannot be converted, and only the data used for training can be converted. Although useful for exploratory data analysis, it is rarely used when the ultimate goal is supervised learning.
-Split the dataset into groups called clusters.
The simplest and most widely used clustering algorithm -Find a cluster center of gravity that represents an area with data and assign data points. Then continue to set the center of gravity of each cluster to the average of the data points. This is repeated and the algorithm is terminated when there is no change. -Since there is no label on the divided data as described above, the only thing that can be seen from the algorithm is that similar images are lined up for the specified number of clusters. ・ Vector quantization
-Start each data point as an individual cluster, merge similar clusters, and repeat the process until the specified number of clusters is reached. -Scikit-learn implements ward, average, and complete, but ward is usually sufficient. ・ It can be visualized with a dendrogram. Can be drawn with SciPy.
・ Abbreviation for density-based spatial clustering of applications with noise -Find points in high-density areas in the feature space. -Be careful when handling the obtained clustering results because noise is generated in the process.
-Searching for the best data representation for a specific application is called feature engineering. -What features are used and whether they are added or combined as needed are the most important factors in determining the success of machine learning applications.
-Sometimes called a dummy variable. Replace categorical variables with new features that have one or more 0 and 1 values. Convert to a form that scikit-learn can handle.
-Make linear models more powerful for continuous data.
・ Effective for linear models, combining original features.
-Cross-validation: Data is folded for any k (roughly 5-10) and used as a training set and test set. There are shuffle divisions and grouped divisions. • Grid search: Validate all combinations of parameters. ・ When actually using machine learning, we are not only interested in accurate predictions, and we often use those predictions in the process of larger decision-making. It is necessary to compare the model in the standard with another model and carefully consider the business impact. -Threshold: A threshold value.
-Multiple processing steps can be glued in the pipeline class into one estimator. -Machine learning apps in the real world rarely use independent models alone. By gluing with the pipeline model, you can reduce mistakes such as forgetting to apply the transformation or making a mistake in the order.
・ Natural language processing (NLP), information retrieval (IR) -A data set in text analysis is called a corpus, and individual data points expressed as one text are called a document. -Since text data cannot be applied to machine learning as it is, it is necessary to convert it into a numerical representation that can handle algorithms. The simplest, most efficient and widely used expression is the BoW (bag-of-words) expression. Discard the structure and count only the number of words that appear. ・ 3 steps to calculate BoW representation (1) Tokenization: Divide each sentence into words using spaces and punctuation marks as a guide. (2) Vocabulary building: Words that appear in all sentences are collected as vocabulary and numbered. (For example, in alphabetical order) (3) Encode: Count the number of vocabulary words that appear. -Can be used with sparse matrix, compressed s.m .: CSR (compressed sparse matrix), SciPy. -Tf-idf (term frequency-inverse document frequency): A method of scaling according to the weight of feature information, instead of dropping features that are unlikely to be important like a stop list. ・ BoW problem → Because the order is lost, some sentences with opposite meanings have exactly the same meaning. (It's bad, not good at all and it's good, not bad at all, etc.) However, this can be avoided by treating the token as two or three consecutive tokens instead of a single one. They are called bigram and trigram, respectively. The difficulty is that the features will increase dramatically. In most cases, it is better to set the minimum length to 1, because even one word often has a considerable meaning. -Stem processing (stemming) and heading wordization (lemmatization) are one of normalization.
"(Japanese title) Machine learning starting with Python" Completed
Recommended Posts