[Translation] scikit-learn 0.18 Tutorial Text data manipulation

Google translated http://scikit-learn.org/0.18/tutorial/text_analytics/working_with_text_data.html scikit-learn 0.18 tutorial table of contents / previous tutorial


Manipulating text data

The purpose of this guide is to explore some of the main scikit-learn tools on one practical task: analysis of text documents (newsgroup posts) on 20 different topics. This section describes how to:

--Load file contents and categories --Extract feature vectors suitable for machine learning --Train a linear model for categorization --Use a grid search strategy to find good configurations for both feature extraction components and classifiers

Tutorial setup

To start this tutorial, you first need to install scikit-learn and all the required dependencies. See the Installation Instructions page (http://scikit-learn.org/0.18/install.html#installation-instructions) for more information and system-specific instructions. The source for this tutorial is in the scikit-learn folder:

scikit-learn/doc/tutorial/text_analytics/

The tutorial folder should contain the following folders:

--* .rst file --Source of tutorial documentation written in sphinx --data --Folder for storing the dataset used in the tutorial --skeletons --Incomplete sample script for exercises --solutions --Practice solutions

Copy skeltons to a new folder called sklearn_tut_workspace somewhere on your hard drive. Keeping the original skeleton, this exercise edits the copied file:

% cp -r skeletons work_directory/sklearn_tut_workspace

Data is required for machine learning. Go to each $ STRUCT_HOME / data subfolder and run the fetch_data.py script from there (after the first load). For example:

% cd $TUTORIAL_HOME/data/languages
% less fetch_data.py
% python fetch_data.py

Load 20 newsgroup datasets

The dataset is called "20 newsgroups". Here is the official description taken from the Website:

A 20 newsgroup dataset is a collection of approximately 20,000 newsgroup documents that are (almost) evenly divided across 20 newsgroups. As far as we know, it was originally collected by Ken Lang for his paper "Newsweeder: Learning to Filter Net News", but it is not explicitly mentioned in this collection. A collection of 20 newsgroups is a popular dataset for experimenting with text applications in machine learning techniques such as text classification and text clustering.

The following uses the dataset loader built into scikit-learn's 20 newsgroups. You can also manually download the dataset from the website and [sklearn.datasets.load_files](http://scikit-learn.org/0.18/modules/generated/sklearn.datasets.load_files.html#sklearn.datasets.load_files] You can also use the) function to specify the 20news-bydate-train subfolder of the uncompressed archive folder. To reduce the execution time of this first example, we will create a partial dataset using only four of the 20 categories available in the dataset.

>>> categories = ['alt.atheism', 'soc.religion.christian',
...               'comp.graphics', 'sci.med']

You can load a list of files that match these categories as follows:

>>> from sklearn.datasets import fetch_20newsgroups
>>> twenty_train = fetch_20newsgroups(subset='train',
...     categories=categories, shuffle=True, random_state=42)

The dataset returned is a scikit-learn "bunch". A simple holder object with fields that can be accessed as an attribute of Python's dict key or ʻobject` for convenience. For example, target_names contains a list of requested category names.

>>> twenty_train.target_names
['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

The file itself is loaded into the memory of the data attribute. Filenames are also available for reference:

>>> len(twenty_train.data)
2257
>>> len(twenty_train.filenames)
2257

Now let's print the first line of the loaded file:

>>> print("\n".join(twenty_train.data[0].split("\n")[:3]))
From: [email protected] (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton

>>> print(twenty_train.target_names[twenty_train.target[0]])
comp.graphics

The supervised learning algorithm requires a category label for each document in the training set. In this case, the category is the name of the newsgroup and also the name of the folder that holds the individual documents. For speed and space efficiency reasons, scikit-learn loads the target attribute as an array of integers corresponding to the index of the category name in the target_names list. The category integer ID for each sample is stored in the target attribute.

>>> twenty_train.target[:10]
array([1, 1, 3, 3, 3, 3, 3, 2, 2, 2])

You can get the category name as follows:

>>> for t in twenty_train.target[:10]:
...     print(twenty_train.target_names[t])
...
comp.graphics
comp.graphics
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
sci.med
sci.med
sci.med

You will notice that the samples are shuffled randomly (with a fixed random seed). It is useful to quickly train the model using only the first sample and try the results of the first idea before training the complete dataset.

Extracting features from text files

In order to perform machine learning on a text document, we first need to convert the text content into a numeric feature vector.

Bag of Words

The most intuitive way to do this is with the Bag of Words representation:

  1. Assign a fixed integer id to each word present in any document in the training set (for example, by building a dictionary from word to integer index).
  2. For each document #i, count the number of occurrences of each word w and store it in X [i, j] as the value of the feature # j. Where j is the index of the word w in the dictionary

The Bags of words expression means that n_features is the number of distinct words in the corpus. This number typically exceeds 100,000. For n_samples == 10000, storing X as a floating-point array of numpy requires 10000 x 100000 x 4bytes = ** 4GB of RAM **, which is difficult on today's computers. Fortunately, most values for ** X will be 0 **. This is because the specified document uses thousands of different words. For this reason, the Bag of Words is a ** high-dimensional sparse dataset **. A lot of memory can be saved by storing only the non-zero part of the feature vector in the memory. The scipy.sparse matrix is exactly the data structure that does this, and scikit-learn has built-in support for these structures.

Tokenize text with scikit-learn

Stopword text preprocessing, tokenization, and filtering are included in higher-level components that can build a dictionary of features and convert documents into feature vectors.

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> count_vect = CountVectorizer()
>>> X_train_counts = count_vect.fit_transform(twenty_train.data)
>>> X_train_counts.shape
(2257, 35788)

CountVectorizer is for N-gram words or consecutive characters. Supports counting. Once fitted, the vectorizer builds a dictionary of feature indicators:

>>> count_vect.vocabulary_.get(u'algorithm')
4690

The index values of the words in the vocabulary are linked to the frequency of the entire training corpus.

From number of times to frequency

Word counting is a good start, but even if you're talking about the same topic, the longer the document, the higher the average count. To avoid these potential discrepancies, it is sufficient to divide the number of occurrences of each word in the document by the total number of words in the document. These new features are called tf in Term Frequencies. Another improvement in tf is to reduce the weight of words that appear in many corpus documents. It is less useful than the words that rarely appear in the corpus. This downscaling is called tf-idf in "Term Frequency times Inverse Document Frequency". ** tf ** and ** tf-idf ** can be calculated as follows:

>> from sklearn.feature_extraction.text import TfidfTransformer
>>> tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
>>> X_train_tf = tf_transformer.transform(X_train_counts)
>>> X_train_tf.shape
(2257, 35788)

The code in the above example first uses the fit (...) method to fit the estimate to the data and then transforms the count-matrix into a tf-idf representation transform (... ) Use the method. By combining these two steps and skipping redundant operations, you can achieve the same end result more quickly. This is done using the fit_transform (..) method, as described in the notes in the previous section, as shown below.

>>> tfidf_transformer = TfidfTransformer()
>>> X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
>>> X_train_tfidf.shape
(2257, 35788)

Classifier training

Now that we have the features, we can train the classifier to predict the categories of posts. Naive Bayes Let's start with the classifier. This provides a great baseline for this task. scikit-learn contains several variants of this classifier. The most suitable for the number of words is a variant of the polynomial:

>>> from sklearn.naive_bayes import MultinomialNB
>>> clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

To predict the outcome of a new document, you need to extract features using almost the same feature extraction chain as before. The difference is that you are already fit to the training set, so call transform instead of fit_transform in the converter.

>>> docs_new = ['God is love', 'OpenGL on the GPU is fast']
>>> X_new_counts = count_vect.transform(docs_new)
>>> X_new_tfidf = tfidf_transformer.transform(X_new_counts)

>>> predicted = clf.predict(X_new_tfidf)

>>> for doc, category in zip(docs_new, predicted):
...     print('%r => %s' % (doc, twenty_train.target_names[category]))
...
'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics

Construction of pipeline

To make it easier to use a series of operations such as "vectorizer => converter => classifier", scikit-learn provides a Pipeline class that behaves like a compound classifier.

>>> from sklearn.pipeline import Pipeline
>>> text_clf = Pipeline([('vect', CountVectorizer()),
...                      ('tfidf', TfidfTransformer()),
...                      ('clf', MultinomialNB()),
... ])

The names vect, tfidf and clf (short for classifier) are arbitrary. Below, we'll see how to use it in the Grid Search section. You can train your model using the following command:

>>> text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

Test set performance assessment

Evaluating the prediction accuracy of a model is just as easy.

>>> twenty_test = fetch_20newsgroups(subset='test',
...     categories=categories, shuffle=True, random_state=42)
>>> docs_test = twenty_test.data
>>> predicted = text_clf.predict(docs_test)
>>> np.mean(predicted == twenty_test.target)            
0.834...

We have achieved an accuracy of 83.4%. Widely recognized as one of the best text classification algorithms (although a bit slower than Naive Bayes) Linear Support Vector Machine (SVM) Let's see if it works with #svm). You can change the learner by simply plugging a different classifier object into the pipeline:

>>> from sklearn.linear_model import SGDClassifier
>>> text_clf = Pipeline([('vect', CountVectorizer()),
...                      ('tfidf', TfidfTransformer()),
...                      ('clf', SGDClassifier(loss='hinge', penalty='l2',
...                                            alpha=1e-3, n_iter=5, random_state=42)),
... ])
>>> _ = text_clf.fit(twenty_train.data, twenty_train.target)
>>> predicted = text_clf.predict(docs_test)
>>> np.mean(predicted == twenty_test.target)            
0.912...

scikit-learn also provides a utility for more detailed performance analysis of results:

>>> from sklearn import metrics
>>> print(metrics.classification_report(twenty_test.target, predicted,
...     target_names=twenty_test.target_names))
...                                         
                        precision    recall  f1-score   support

           alt.atheism       0.95      0.81      0.87       319
         comp.graphics       0.88      0.97      0.92       389
               sci.med       0.94      0.90      0.92       396
soc.religion.christian       0.90      0.95      0.93       398

           avg / total       0.92      0.91      0.91      1502


>>> metrics.confusion_matrix(twenty_test.target, predicted)
array([[258,  11,  15,  35],
       [  4, 379,   3,   3],
       [  5,  33, 355,   3],
       [  5,  10,   4, 379]])

As expected, the Confused Matrix shows that posts from newsgroups on atheism and Christians are often confused with each other rather than computer graphics.

Parameter tuning by grid search

Parameters such as ʻuse_idfhave already occurred inTfidfTransformer. Classifiers also tend to have many parameters. MultinomialNB contains the smoothing parameter ʻalpha, and SGDClassifier has a penalty parameter ʻalphaand a loss and penalty term that can be set within the objective function (see the module documentation or See these instructions using the Pythonhelp` function). Instead of fine-tuning the parameters of the various components of the chain, you can exhaustively search for the best parameters on the grid of possible values. Try with or without idf, penalty parameter (alpha) 0.01 or 0.001 for linear SVM, and vectorizer splitting method, either word or bigram.

>>> from sklearn.model_selection import GridSearchCV
>>> parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
...               'tfidf__use_idf': (True, False),
...               'clf__alpha': (1e-2, 1e-3),
... }

Obviously, such a thorough search can be expensive. If you have multiple CPU cores, you can pass the n_jobs parameter to the grid searcher to instruct them to try a combination of eight parameters in parallel. If you specify a value of -1 for this parameter, the number of installed cores will be detected and all of them will be used.

>>> gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)

The grid search instance behaves like a regular Scikit learning model. Perform searches on a smaller subset of your training data to speed up your calculations:

>>> gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400])

The result of fiting the GridSearchCV object is a classifier that predict can use.

>>> twenty_train.target_names[gs_clf.predict(['God is love'])[0]]
'soc.religion.christian'

The object's best_score_ and best_params_ attributes store the best average score and parameter settings for that score.

>>> gs_clf.best_score_                                  
0.900...
>>> for param_name in sorted(parameters.keys()):
...     print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))
...
clf__alpha: 0.001
tfidf__use_idf: True
vect__ngram_range: (1, 1)

See gs_clf.cv_results_ for more information on searching. The cv_results_ parameter can easily be imported into pandas as DataFrame for later inspection.

Exercise

To exercise, copy the contents of the skeletons folder as a new folder named workspace.

% cp -r skeletons workspace

You can then edit the contents of the workspace without losing the original exercise instructions. Then launch the ipython shell and run the work-in-progress script.

[1] %run workspace/exercise_XX_script.py arg1 arg2 arg3

If an exception is triggered, use % debug to start an autopsy debug session. Improve the implementation and repeat until the exercise is resolved. Each exercise provides all the import statements needed for a skeleton file, boilerplate code to read the data, and sample code to evaluate the prediction accuracy of the model.

Exercise 1: Language identification

--Use a custom preprocessor and CharNGramAnalyzer to create a text classification pipeline using Wikipedia article data as a training set. --Evaluate the performance of some test sets.

ipython command line:

%run workspace/exercise_01_language_train_model.py data/languages/paragraphs/

Exercise 2: Sentiment Analysis of Movie Reviews

--Create a text classification pipeline to classify movie reviews in a positive or negative way. --Use a grid search to find the appropriate parameter set. --Evaluate the performance of holdout test sets.

ipython command line:

%run workspace/exercise_02_sentiment.py data/movie_reviews/txt_sentoken/

Exercise 3: CLI Text Classification Utility

--Use the results of the previous exercise and the cPickle module of the standard library to detect the language of the text supplied by stdin and estimate the polarity (positive or negative) if the text is written in English. Write the command line utility to do. --Bonus points: Whether the utility can give a level of confidence for that prediction.

next

After completing this tutorial, there are some suggestions to help you understand scikit intuitively:

-CountVectorizer ʻanalyzer and token normalization` Try --If you don't have a label, use Clustering (http://scikit-learn.org/0.18/auto_examples/text/document_clustering.html#sphx-glr-auto-examples-text-document-clustering-py) Please try. --If you have multiple labels (categories, etc.) in a document, see Multiclass and multilabel sections (http://qiita.com/nazoking@github/items/9decf45d106accc6afe1). -Use Truncated SVD for Latent Semantics Please try. -[Out-of-Core Classification](http://scikit-learn.org/0.18/auto_examples/applications/plot_out_of_core_classification.html#sphx-glr-auto-examples-applications-plot-out-of-core-classification- See how to use py) to learn from data that doesn't fit in your computer's main memory. -[Hashing Vectorizer](http://qiita.com/nazoking@github/items/b270288fa38aed0a71bf#4239-hashingvectorizer%E3%81%A7%E3%82%A2%E3%82%A6%E3%83%88% E3% 82% AA% E3% 83% 96% E3% 82% B3% E3% 82% A2% E3% 82% B9% E3% 82% B1% E3% 83% BC% E3% 83% AA% E3% See 83% B3% E3% 82% B0% E3% 82% 92% E5% AE% 9F% E8% A1% 8C% E3% 81% 99% E3% 82% 8B) as an alternative to CountVectorizer's memory. Let's.


Tutorial Table of Contents / Next Tutorial

© 2010 --2016, scikit-learn developers (BSD license).

Recommended Posts

[Translation] scikit-learn 0.18 Tutorial Text data manipulation
[Translation] scikit-learn 0.18 tutorial Statistical learning tutorial for scientific data processing
[Translation] scikit-learn 0.18 Tutorial Table of Contents
[Translation] scikit-learn 0.18 User Guide 4.3. Data preprocessing
[Translation] scikit-learn 0.18 Tutorial External resources, videos, talk
[Translation] scikit-learn 0.18 Tutorial Choosing the Right Model
PySpark data manipulation
[Translation] hyperopt tutorial
[Translation] scikit-learn 0.18 Tutorial Statistical learning tutorial for scientific data processing Put all together
[Translation] scikit-learn 0.18 Tutorial Introduction of machine learning by scikit-learn
[Translation] scikit-learn 0.18 Tutorial Search for help on statistical learning tutorials for scientific data processing
streamlit tutorial Japanese translation
[Translation] scikit-learn 0.18 Tutorial Statistical learning tutorial for scientific data processing Unsupervised learning: Finding the representation of data
[Translation] scikit-learn 0.18 tutorial Statistical learning tutorial for scientific data processing Statistical learning: Settings and estimator objects in scikit-learn
Data manipulation with Pandas!
[Python tutorial] Data structure
[Translation] scikit-learn 0.18 Tutorial Statistical learning tutorial for scientific data processing Model selection: Estimator and its parameter selection
[Translation] scikit-learn 0.18 tutorial Statistical learning tutorial for scientific data processing Supervised learning: Predicting output variables from high-dimensional observations
TensorFlow Tutorial-MNIST Data Download (Translation)
Data Manipulation in Python-Try Pandas_plyr
Select features using text data
Text data preprocessing (vectorization, TF-IDF)