Google translated http://scikit-learn.org/0.18/modules/feature_extraction.html [scikit-learn 0.18 User Guide 4. Dataset Conversion](http://qiita.com/nazoking@github/items/267f2371757516f8c168#4-%E3%83%87%E3%83%BC%E3%82%BF From% E3% 82% BB% E3% 83% 83% E3% 83% 88% E5% A4% 89% E6% 8F% 9B)
The sklearn.feature_extraction
module can be used to extract features in formats supported by machine learning algorithms from datasets of formats such as text and images.
The class DictVectorizer is a feature represented as a list of standard Python dict objects. It can be used to convert an array to the NumPy / SciPy representation used by the scikit-learn estimator. It's not particularly fast, but Python's dict has the advantage of being easy to use, sparse (no need to store non-existent features), and being able to store feature names in addition to values. DictVectorizer implements one-to-one coding or "one-hot" coding for categories (or nominal, discrete values). Category attributes are pairs of "attributes-values" that are restricted to a list of discrete values (topic identifiers, object types, tags, names, etc.). In the following, "city" is a category attribute and "temperature" is a conventional numerical feature.
>>> measurements = [
... {'city': 'Dubai', 'temperature': 33.},
... {'city': 'London', 'temperature': 12.},
... {'city': 'San Fransisco', 'temperature': 18.},
... ]
>>> from sklearn.feature_extraction import DictVectorizer
>>> vec = DictVectorizer()
>>> vec.fit_transform(measurements).toarray()
array([[ 1., 0., 0., 33.],
[ 0., 1., 0., 12.],
[ 0., 0., 1., 18.]])
>>> vec.get_feature_names()
['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']
DictVectorizer is a useful expression transformation for training a sentence classifier in a natural language processing model (typically extracting features from around a particular word of interest). For example, suppose you have a first algorithm that extracts part of speech (PoS) tags from a sentence. The following dict is such a feature extracted from around the word "sat" in the sentence "The cat sat on the mat."
>>> pos_window = [
... {
... 'word-2': 'the',
... 'pos-2': 'DT',
... 'word-1': 'cat',
... 'pos-1': 'NN',
... 'word+1': 'on',
... 'pos+1': 'PP',
... },
... #In a real application you will extract many such dicts
... ]
This data can be vectorized into a sparse 2D matrix suitable for feeding to the classifier (text.TfidfTransformer for normalization. After being piped to modules / generated / sklearn.feature_extraction.text.TfidfTransformer.html # sklearn.feature_extraction.text.TfidfTransformer).
>>> vec = DictVectorizer()
>>> pos_vectorized = vec.fit_transform(pos_window)
>>> pos_vectorized
<1x6 sparse matrix of type '<... 'numpy.float64'>'
with 6 stored elements in Compressed Sparse ... format>
>>> pos_vectorized.toarray()
array([[ 1., 1., 1., 1., 1., 1.]])
>>> vec.get_feature_names()
['pos+1=PP', 'pos-1=NN', 'pos-2=DT', 'word+1=on', 'word-1=cat', 'word-2=the']
As you can imagine, extracting such a context around individual words in the document corpus results in a very wide matrix (many one-hot-features), almost zero. Will be. To store the resulting data structures in memory, the DictVectorizer
class uses the scipy.sparse
matrix by default instead of the numpy.ndarray
.
The FeatureHasher class is a feature hash or "[hashing trick](https: //). en.wikipedia.org/wiki/Feature_hashing) is a fast, low memory vectorizer that uses a technology known as. Instead of building a feature value conversion table, FeatureHasher applies a hash function to the feature values to directly determine the column numbers in the sample matrix. As a result, it speeds up and reduces memory usage at the expense of checkability. Hasher doesn't remember the appearance of input feature values, nor does it have a ʻinverse_transformmethod. Hash functions can cause conflicts between (irrelevant) features, so signed hash functions are used, and the sign of the hash value determines the sign of the value stored in the feature value output matrix. In this way, collisions are likely to be offset rather than cumulative, and the expected average of the output characteristic values is zero. If
non_negative = Trueis passed to the constructor, the absolute value will be retrieved. This undoes some of the conflict handling, but expects non-negative input [sklearn.naive_bayes.MultinomialNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB. html # sklearn.naive_bayes.MultinomialNB) Estimator and [sklearn.feature_selection.chi2](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2) Features You can pass the output to a selector etc. FeatureHasher accepts a mapping (such as a Python dict or its variant in the
collections module), a pair of
(feature, value) , or a list of strings, depending on the constructor parameter ʻinput_type
. The mapping is treated as a list of (feature, value)
, where ['feat1','feat2','feat3']
is [('feat1', 1), ('feat2', 1), ( Interpreted as'feat3', 1)]
. If a feature value appears multiple times in the sample, the related values are summed (so ('feat', 2)
and ('feat', 3.5)
are ('feat', 5.5)
The output from FeatureHasher is always a CSR-style scipy.sparse
matrix.
Feature hashes can be used for document classification, but text.CountVectorizer, FeatureHasher does not do any word splitting or other pre-processing other than Unicode to UTF-8 encoding. For the combined tokenization / hasher, see [Vectoring a large text corpus using hash tricks](# 4238-% E3% 83% 8F% E3% 83% 83% E3% 82% B7% E3) % 83% A5% E3% 83% 88% E3% 83% AA% E3% 83% 83% E3% 82% AF% E3% 82% 92% E4% BD% BF% E7% 94% A8% E3% 81 % 97% E3% 81% A6% E5% A4% A7% E3% 81% 8D% E3% 81% AA% E3% 83% 86% E3% 82% AD% E3% 82% B9% E3% 83% 88 % E3% 82% B3% E3% 83% BC% E3% 83% 91% E3% 82% B9% E3% 82% 92% E3% 83% 99% E3% 82% AF% E3% 83% 88% E3 See% 83% AB% E5% 8C% 96% E3% 81% 99% E3% 82% 8B).
As an example, consider a word-level natural language processing task that requires features extracted from a (token, part_of_speech)
pair. You can use the Python generator function to extract features:
def token_features(token, part_of_speech):
if token.isdigit():
yield "numeric"
else:
yield "token={}".format(token.lower())
yield "token,pos={},{}".format(token, part_of_speech)
if token[0].isupper():
yield "uppercase_initial"
if token.isupper():
yield "all_uppercase"
yield "pos={}".format(part_of_speech)
Then the raw_X
supplied to FeatureHasher.transform
can be built using:
raw_X = (token_features(tok, pos_tagger(tok)) for tok in corpus)
It is supplied to the Hasher as follows.
hasher = FeatureHasher(input_type='string')
X = hasher.transform(raw_X)
Get the scipy.sparse
matrix X
.
Note the generator that makes it easier to extract features. Tokens are only processed upon request from the Hasher.
FeatureHasher uses a signed 32-bit version of MurmurHash3. As a result (and due to scipy.sparse limitations), the maximum number of features currently supported is $ 2 ^ {31} -1 $.
In the original hashing trick formulation by Weinberger et al., Two separate hash functions $ h $ and $ \ xi $ were used to determine the column number and sign of the feature values, respectively. The current implementation works on the assumption that the sign bit of MurmurHash3 is independent of the other bits.
It is recommended to use a power of 2 as the n_features
parameter, as it simply uses the remainder of the division to convert the hash function to a column number. Otherwise, the feature values will not be mapped evenly.
--Reference: --Killian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, Josh Attenberg (2009) Functional Hashing for Large-Scale Multitask Learning Proc. ICML. - MurmurHash3
Text analysis is a major application of machine learning algorithms. However, raw data cannot supply a set of symbols directly to the algorithm itself. Because most of them expect fixed-sized numeric feature vectors rather than variable-length raw text documents. To address this, scikit-learn provides a utility for the most common way to extract numerical features from text content.
--Give an integer ID for each possible token, such as tokenizing a string and using spaces and punctuation as token separators. --Count the appearance of tokens in each document. --Normalize and weight using less important tokens that occur in the majority of samples / documents.
In this scheme, features and samples are defined as follows:
--The frequency of appearance of individual tokens is treated as a feature (may or may not be normalized). --A vector of all token frequencies in a given document is considered a multivariate sample.
Thus, a corpus of a document can be represented by a matrix with one row per document and one column for each token (eg, word) that appears in the corpus. Vectorization is the general process of converting a set of text documents into a numeric feature vector. This particular strategy (tokenization, counting, normalization) is called the Bag of Words or "Bag of n-grams" representation. A document is described by the number of occurrences of a word, completely ignoring the relative position information of the word in the document.
Most documents use only a few of the words commonly used in corpora, so the resulting matrix contains many zero (usually 99% or more) feature values.
For example, a set of 10,000 short documents (such as email) uses a total of 100,000 orders of vocabulary, and each document uses 100 to 1000 unique words.
Implementations typically use a sparse representation of the scipy.sparse
package to not only store such matrices in memory, but also to speed up algebraic matrices / vectors.
CountVectorizer provides both tokenization and count of occurrences. It is implemented in one class.
>>> from sklearn.feature_extraction.text import CountVectorizer
There are many parameters in this model, but the default values are very reasonable (see the Reference Documentation (http://scikit-learn.org/stable/modules/classes.html#text-feature-extraction- for more information). ref)) ::
>>> vectorizer = CountVectorizer(min_df=1)
>>> vectorizer
CountVectorizer(analyzer=...'word', binary=False, decode_error=...'strict',
dtype=<... 'numpy.int64'>, encoding=...'utf-8', input=...'content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern=...'(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)
Use it to tokenize and count the occurrence of words in the minimal corpus of text documents.
>>> corpus = [
... 'This is the first document.',
... 'This is the second second document.',
... 'And the third one.',
... 'Is this the first document?',
... ]
>>> X = vectorizer.fit_transform(corpus)
>>> X
<4x9 sparse matrix of type '<... 'numpy.int64'>'
with 19 stored elements in Compressed Sparse ... format>
The default setting is to extract words with two or more letters and tokenize the string. The specific function that performs this step can be explicitly requested.
>>> analyze = vectorizer.build_analyzer()
>>> analyze("This is a text document to analyze.") == (
... ['this', 'is', 'text', 'document', 'to', 'analyze'])
True
Each term detected by the analyzer during fitting is assigned a unique integer index that corresponds to the column in the resulting matrix. You can get the interpretation of this column as follows:
>>> vectorizer.get_feature_names() == (
... ['and', 'document', 'first', 'is', 'one',
... 'second', 'the', 'third', 'this'])
True
>>> X.toarray()
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
[0, 1, 0, 1, 0, 2, 1, 0, 1],
[1, 0, 0, 0, 1, 0, 1, 1, 0],
[0, 1, 1, 1, 0, 0, 1, 0, 1]]...)
The reverse mapping of feature values to column indexes is stored in the vectorizer's vocabulary_ attribute.
>>> vectorizer.vocabulary_.get('document')
1
Therefore, words not found in the training corpus will be completely ignored in future calls to the conversion method.
>>> vectorizer.transform(['Something completely new.']).toarray()
...
array([[0, 0, 0, 0, 0, 0, 0, 0, 0]]...)
Note that in the previous corpus, the first and last documents have exactly the same words and are encoded by the equality vector. In particular, you lose the information that the last document is in question format. To store order information, you can extract 2 grams of words in addition to 1 gram (individual words).
>>> bigram_vectorizer = CountVectorizer(ngram_range=(1, 2),
... token_pattern=r'\b\w+\b', min_df=1)
>>> analyze = bigram_vectorizer.build_analyzer()
>>> analyze('Bi-grams are cool!') == (
... ['bi', 'grams', 'are', 'cool', 'bi grams', 'grams are', 'are cool'])
True
The vocabulary extracted by this vectorizer is much larger and can resolve the ambiguity of word order position information when encoded.
>>>
>>> X_2 = bigram_vectorizer.fit_transform(corpus).toarray()
>>> X_2
...
array([[0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0],
[0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0],
[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0],
[0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1]]...)
Specifically, the question sentence "Is this" exists only in the last document.
>>> feature_index = bigram_vectorizer.vocabulary_.get('is this')
>>> X_2[:, feature_index]
array([0, 0, 0, 1]...)
In a large text corpus, some words are very numerous (eg, "the", "a", "is" in English) and have little meaningful information about the content of the document. Providing frequency information directly to the classifier hides the frequency of rare but interesting terms for very frequently used terms.
It is very common to use the tf-idf transform to make the frequency information suitable for the classifier.
Tf-idf means the reciprocal of the number of words appearing (tf) and the number of appearing sentences (idf), and $ \ text {tf-idf (t, d)} = \ text {tf (t, d)} \ times \ text {idf (t)} $.
The default setting for TfidfTransformer`` TfidfTransformer (norm ='l2', use_idf = True, smooth_idf = True, sublinear_tf = False)
The word frequency, the number of times a word occurs in a given document, is multiplied by the IDF component calculated as follows:
\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}} + 1
Where $ n_d $ is the total number of documents and $ \ text {df} (d, t) $ is the number of documents containing the word $ t $. The resulting tf-idf vector is normalized by the Euclidean norm.
v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{\_1}^2 + v{\_2}^2 + \dots + v{\_n}^2}}
This was originally a word weighting method (search engine search result ranking function) developed for information retrieval, which is often used in document classification and clustering. In the sections below, how exactly tf-idfs is calculated, scikit-learn's [TfidfTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html Calculated with # sklearn.feature_extraction.text.TfidfTransformer) and TfidfVectorizer Shows if the tf-idfs given is a little different from the standard textbook notation that defines idf as follows:
\text{idf}(t) = log{\frac{n_d}{1+\text{df}(d,t)}}
In TfidfTransformer and TfidfVectorizer, smooth_idf = False
adds" 1 "to idf instead of the denominator of idf.
This normalization is implemented by the TfidfTransformer class.
>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> transformer = TfidfTransformer(smooth_idf=False)
>>> transformer
TfidfTransformer(norm=...'l2', smooth_idf=False, sublinear_tf=False,
use_idf=True)
Again, see the Reference Documentation (http://scikit-learn.org/stable/modules/classes.html#text-feature-extraction-ref) for more information on all the parameters. Consider the following example. The appearance rate of the first word is 100%, which is not very interesting. The appearance rate of the other two features is less than 50%, which probably better represents the content of the document.
>>> counts = [[3, 0, 1],
... [2, 0, 0],
... [3, 0, 0],
... [4, 0, 0],
... [3, 2, 0],
... [3, 0, 2]]
...
>>> tfidf = transformer.fit_transform(counts)
>>> tfidf
<6x3 sparse matrix of type '<... 'numpy.float64'>'
with 9 stored elements in Compressed Sparse ... format>
>>> tfidf.toarray()
array([[ 0.81940995, 0. , 0.57320793],
[ 1. , 0. , 0. ],
[ 1. , 0. , 0. ],
[ 1. , 0. , 0. ],
[ 0.47330339, 0.88089948, 0. ],
[ 0.58149261, 0. , 0.81355169]])
Each row is normalized to have a unit Euclidean standard:
v\_{norm} = \frac{v}{||v||\_2} = \frac{v}{\sqrt{v{\_1}^2 + v{\_2}^2 + \dots + v{_n}^2}}
For example, you can calculate tf-idf in the first term of the first document in the counts array as follows:
n_{d, {\text{term1}}} = 6 \\
\text{df}(d, t)_{\text{term1}} = 6 \\
\text{idf}(d, t)_{\text{term1}} = log \frac{n_d}{\text{df}(d, t)} + 1 = log(1)+1 = 1 \\
\text{tf-idf}_{\text{term1}} = \text{tf} \times \text{idf} = 3 \times 1 = 3
Repeating this calculation for the remaining two terms in the document gives:
\text{tf-idf}_{\text{term2}} = 0 \times log(6/1)+1 = 0 \\
\text{tf-idf}_{\text{term3}} = 1 \times log(6/2)+1 \approx 2.0986
Raw tf-idfs vector:
\text{tf-idf}_raw = [3, 0, 2.0986]
Next, applying the Euclidean (L2) norm gives the following tf-idfs for document 1.
\frac{[3, 0, 2.0986]}{\sqrt{\big(3^2 + 0^2 + 2.0986^2\big)}}
= [ 0.819, 0, 0.573]
In addition, the default parameter smooth_idf = True
adds a" 1 "to the numerator and denominator. It looks like the extra document contains exactly once for every term in the collection. Prevents zero percent.
\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}} + 1
Using this change, the tf-idf in Section 3 of Document 1 is changed to 1.8473.
\text{tf-idf}_{\text{term3}} = 1 \times log(7/3)+1 \approx 1.8473
And the L2 normalized tf-idf
\frac{[3, 0, 1.8473]}{\sqrt{\big(3^2 + 0^2 + 1.8473^2\big)}} = [0.8515, 0, 0.5243]
>>> transformer = TfidfTransformer()
>>> transformer.fit_transform(counts).toarray()
array([[ 0.85151335, 0. , 0.52433293],
[ 1. , 0. , 0. ],
[ 1. , 0. , 0. ],
[ 1. , 0. , 0. ],
[ 0.55422893, 0.83236428, 0. ],
[ 0.63035731, 0. , 0.77630514]])
The weight of each feature calculated by the fit
method call is stored in the model attribute.
>>> transformer.idf_
array([ 1. ..., 2.25..., 1.84...])
Since tf-idf is so popular for text functions, there is another class called TfidfVectorizer that combines all the options of CountVectorizer and TfidfTransformer into a single model:
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> vectorizer = TfidfVectorizer(min_df=1)
>>> vectorizer.fit_transform(corpus)
...
<4x9 sparse matrix of type '<... 'numpy.float64'>'
with 19 stored elements in Compressed Sparse ... format>
TF-IDF normalization is often very useful, but there may be examples where binary generation markers provide better features. This can be achieved by using the CountVectorizer's binary
parameter. In particular, estimators such as Bernoulli Naive Bayes explicitly model discrete Boolean random variables. .. Also, very short text can be noisy with tf-idf, while binary generation information is more stable.
As always, the best way to adjust feature extraction parameters is to use a cross-validated grid search, for example, by pipelined the feature extractors with a classifier.
-[Sample Pipeline for Text Feature Extraction and Evaluation](http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid- search-text-feature-extraction-py)
The text is made up of characters, but the file is made up of bytes. These bytes represent characters according to some encoding. To work with a text file in Python, you need to decode that byte into a character set called Unicode. Common encodings are ASCII, Latin-1 (Western Europe), KOI8-R (Russian), and universal encodings UTF-8 and UTF-16. There are many others.
The scikit-learn text feature extractor knows how to decode a text file, but it only appears if you specify how to encode the file. For this purpose, CountVectorizer has a ʻencodingparameter. I will take it. For modern text files, the correct encoding is probably UTF-8, which is the default (ʻencoding = "utf-8"
).
But if the text you are loading is not actually UTF-8 encoded, you will get ʻUnicodeDecodeError. For decode errors, silence the vectorizer by setting the decode_error parameter to
"ignore "or
" replace ". At the Python prompt, type help (bytes.decode) to see the documentation for the Python function
bytes.decode`.
If you're having trouble decoding text, give it a try:
--Check what the actual encoding of the text is. The file may have an encoding header or README. There may also be standard encodings that can be assumed based on the source of the text.
--You can use UNIX command files to find out what encodings are common. Python's chardet module comes with a script called chardetect.py
that guesses a particular encoding, but I can't guess it's correct.
--You can also try UTF-8 and ignore the error. You can decode the byte string with bytes.decode (errors ='replace')
and replace all decoding errors with nonsensical characters, or you can set decode_error ='replace'
in the vectorizer. .. This can impair the usefulness of features.
--The actual text may come from different sources that may be using different encodings, and may inadvertently be decoded in a different encoding than the one encoded. This is common in text retrieved from the web. The Python package ftfy can automatically sort the decoding errors of some classes, so it decodes unknown text as latin-1. You can use ftfy to fix the error.
--If you have a jumble of encodings where the text is difficult to sort (such as the 20 Newsgroups dataset), you can revert to a simple single byte encoding like latin-1. Some text may not display correctly, but at least the bytes in the same sequence always represent the same function.
For example, the following snippet uses chardet (not shipped with scikit-learn and must be installed separately) to figure out the encoding of the three texts. Then vectorize the text and print the learned vocabulary. The output is not displayed here.
>>> import chardet
>>> text1 = b"Sei mir gegr\xc3\xbc\xc3\x9ft mein Sauerkraut"
>>> text2 = b"holdselig sind deine Ger\xfcche"
>>> text3 = b"\xff\xfeA\x00u\x00f\x00 \x00F\x00l\x00\xfc\x00g\x00e\x00l\x00n\x00 \x00d\x00e\x00s\x00 \x00G\x00e\x00s\x00a\x00n\x00g\x00e\x00s\x00,\x00 \x00H\x00e\x00r\x00z\x00l\x00i\x00e\x00b\x00c\x00h\x00e\x00n\x00,\x00 \x00t\x00r\x00a\x00g\x00 \x00i\x00c\x00h\x00 \x00d\x00i\x00c\x00h\x00 \x00f\x00o\x00r\x00t\x00"
>>> decoded = [x.decode(chardet.detect(x)['encoding'])
... for x in (text1, text2, text3)]
>>> v = CountVectorizer().fit(decoded).vocabulary_
>>> for term in v: print(v)
(Depending on the version of chardet
, the first one may be wrong.)
For an overview of Unicode and character encoding, see Joel Spolsky's Absolute Minimums Every Software Developer Needs to Know About Unicode. please.
The Bag of Words representation is very simple, but it's actually surprisingly useful. Especially in the ** supervised configuration **, it can be nicely combined with a fast and scalable linear model to train the ** document classifier **.
-[Classification of text documents using sparse features](http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html#sphx-glr-auto-examples-text-document-classification-20newsgroups- py)
In the unsupervised setting, similar documents are grouped together by applying a clustering algorithm such as K-means. Can be grouped.
-Clustering text documents using k-means
Finally, by relaxing the hard allocation constraints of clustering, such as by using Non-Negative Matrix Factorization (NMF or NNMF) (http://scikit-learn.org/stable/modules/decomposition.html#nmf). It is possible to discover the main topics of the corpus.
-[Topic Extraction Using Non-Negative Matrix Decomposition and Latent Dirichlet Allocation](http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-topics-extraction -with-nmf-lda-py)
The set of unigrams (which word bag) cannot capture phrases or multi-word expressions and effectively ignores word order dependencies. In addition, the Bag of Words model does not explain potential misspellings or word derivations.
N grams for rescue! Instead of constructing a simple set of unigrams (n = 1), one might prefer a set of bigrams (n = 2) that count the occurrence of consecutive word pairs.
Alternatively, a set of N-grams of letters, elastic expressions for misspellings and derivatives may be considered.
For example, let's say you're dealing with a corpus of two documents ['words','wprds']
. The second document contains a misspelling of the word "word". A simple Bag of Words representation considers these two as very different documents and differs in both of the two possible features. However, the character 2 gram representation can find documents that match 4 of the 8 features, which can help the preferred classifier make a better decision.
>>> ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2, 2), min_df=1)
>>> counts = ngram_vectorizer.fit_transform(['words', 'wprds'])
>>> ngram_vectorizer.get_feature_names() == (
... [' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp'])
True
>>> counts.toarray().astype(int)
array([[1, 1, 1, 0, 1, 1, 1, 0],
[1, 1, 0, 1, 1, 1, 0, 1]])
In the above example, the char_wb
analyzer is used. It creates n grams only from letters within word boundaries (with spaces embedded on both sides). Instead, the char
analyzer creates an n-gram that spans words:
>>>
>>> ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(5, 5), min_df=1)
>>> ngram_vectorizer.fit_transform(['jumpy fox'])
...
<1x4 sparse matrix of type '<... 'numpy.int64'>'
with 4 stored elements in Compressed Sparse ... format>
>>> ngram_vectorizer.get_feature_names() == (
... [' fox ', ' jump', 'jumpy', 'umpy '])
True
>>> ngram_vectorizer = CountVectorizer(analyzer='char', ngram_range=(5, 5), min_df=1)
>>> ngram_vectorizer.fit_transform(['jumpy fox'])
...
<1x5 sparse matrix of type '<... 'numpy.int64'>'
with 5 stored elements in Compressed Sparse ... format>
>>> ngram_vectorizer.get_feature_names() == (
... ['jumpy', 'mpy f', 'py fo', 'umpy ', 'y fox'])
True
The word boundary-aware variant char_wb
is especially interesting in languages that use whitespace for word separation. For such languages, both predictive accuracy and convergence speed of classifiers trained with such features can be increased while maintaining robustness for misspellings and word derivation.
You can save some word order by extracting n grams instead of individual words, but Bag of Words and Bag of n-gram destroy most of the internal structure of a document and its internal structure. Destroy most of the meaning carried by.
To tackle a wide range of natural language understanding challenges, we need to consider the local structure of sentences and paragraphs. Many of these models are cast as "structured output" issues that are currently outside the scope of scikit-learn.
The vectorization scheme above is simple, but it keeps the string-to-index mapping (vocabulary_attribute) in memory, which causes some problems when dealing with large datasets.
――The larger the corpus, the larger the vocabulary and the more memory used. --The fitting requires the assignment of an intermediate data structure whose size is proportional to the size of the original dataset. --To build a word mapping, you need to read the entire dataset once. Therefore, it is not possible to exactly fit a text classifier online. --Serialization / deserialization of vectorizers with large vocabulary is very slow (usually much slower than serializing / deserializing flat data structures such as NumPy arrays of the same size). --Since the vocabulary_ attribute has a fine-grained synchronization barrier, it is not easy to divide the vectorization work into parallel subtasks. That is, the token string-to-feature index mapping must share tokens because of their first occurrence order, which can slow parallelization performance over sequential processing.
sklearn.feature_extraction.FeatureHasher A "hashing trick" implemented by the class ([feature] Hash](http://scikit-learn.org/stable/modules/feature_extraction.html#feature-hashing)) and [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn. You can overcome these limitations by combining the text preprocessing and tokenization capabilities of feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer). This combination is a transclass HashingVectorizer that is mostly compatible with CountVectorizer. It is implemented in .HashingVectorizer). HashingVectorizer is stateless. That is, you don't have to call it.
>>> from sklearn.feature_extraction.text import HashingVectorizer
>>> hv = HashingVectorizer(n_features=10)
>>> hv.transform(corpus)
...
<4x10 sparse matrix of type '<... 'numpy.float64'>'
with 16 stored elements in Compressed Sparse ... format>
In the vector output, you can see that 16 non-zero feature tokens have been extracted. This is less than the 19 nonzero previously extracted by the CountVectorizer of the same toy corpus. This discrepancy is due to hash function collisions due to the low value of the n_features
parameter.
In real-world settings, the n_features
parameter can be left at its default value of 2 ** 20
(about 1 million possible features). If memory or the size of the downstream model matters, choosing a lower value, such as 2 ^ 18
, will help you without introducing too many additional conflicts into a typical text classification task. ..
The number of dimensions does not affect the CPU training time of the algorithms that work with the CSR matrix (LinearSVC (dual = True)
, Perceptron
, SGDClassifier
, PassiveAggressive
), but the algorithms that work with the CSC matrix ( Note that it affects LinearSVC (dual = False)
, Lasso ()
, etc.).
Try again with the default settings:
>>> hv = HashingVectorizer()
>>> hv.transform(corpus)
...
<4x1048576 sparse matrix of type '<... 'numpy.float64'>'
with 19 stored elements in Compressed Sparse ... format>
>>>
We no longer get collisions, but this comes at the expense of the greater dimensionality of the output space. Of course, terms other than 19 used here can still conflict with each other. HashingVectorizer also has the following restrictions:
--Due to the one-way nature of the hash function that performs the mapping, you cannot reverse the model (without the ʻinverse_transform` method) or access the original string representation of the feature. --No IDF weighting is done to prevent the model from having state. TfidfTransformer should be added in the pipeline as needed. I can.
An interesting development using HashingVectorizer is Out-of-Core. The ability to perform scaling. In other words, you can learn from data that doesn't fit in your computer's main memory. The strategy to implement Out-of-Core scaling is to stream the data into the estimator in a mini-batch. Each mini-batch is vectorized using the HashingVectorizer to ensure that the estimator's input space always has the same dimensions. Therefore, the amount of memory used at any given time is limited by the size of the mini-batch. There is no limit to the amount of data that can be captured using such an approach, but from a practical point of view, training time is often limited by the CPU time spent on the task. For a full-fledged example of Out-of-Core scaling in a text classification task, see Out-of-Core Classification of Text Documents (http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html# See sphx-glr-auto-examples-applications-plot-out-of-core-classification-py).
You can customize the behavior by passing the callable to the vectorizer constructor.
>>> def my_tokenizer(s):
... return s.split()
...
>>> vectorizer = CountVectorizer(tokenizer=my_tokenizer)
>>> vectorizer.build_analyzer()(u"Some... punctuation!") == (
... ['some...', 'punctuation!'])
True
In particular,
--preprocessor
: A callable object that can call the entire document as input (as a single string) and return the converted document. This can be used to remove HTML tags or lowercase the entire document.
-- tokenizer
: A callable object that splits the output from the preprocessor into tokens and returns a list of them.
--ʻAnalyzer`: A callable object that replaces the preprocessor and tokenizer. All default analyzers call the preprocessor and tokenizer, but custom analyzers skip this. N-gram extraction and stopword filtering are done at the analyzer level, so your custom analyzer will need to reproduce these steps.
(Lucene users may recognize these names, but be aware that the scikit-learn concept may not have a one-to-one correspondence with the Lucene concept).
Instead of passing custom functions, you can derive from the class and override the build_preprocessor
, build_tokenizer
, and build_analyzer
factory methods so that the preprocessor, tokenizer, and analyzer can recognize the model parameters.
Some tips and tricks:
--If the document is pre-tokenized by an external package, store the tokens separated by whitespace in a file (or string) and pass ʻanalyzer = str.split` --Great token level analysis such as stemming, glyph splitting, compound splitting, and part-speech-based filtering is not included in the scikit-learning codebase, but can be added by customizing either the tokenizer or the analyzer. Here is a CountVectorizer with a tokenizer and rematizer using NLTK:
>>> from nltk import word_tokenize
>>> from nltk.stem import WordNetLemmatizer
>>> class LemmaTokenizer(object):
... def __init__(self):
... self.wnl = WordNetLemmatizer()
... def __call__(self, doc):
... return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]
...
>>> vect = CountVectorizer(tokenizer=LemmaTokenizer())
(Note that this does not exclude punctuation).
Vectorizer customization is also useful when dealing with Asian languages that do not use explicit word separators such as whitespace.
extract_patches_2 The function is an image stored as a two-dimensional array, Or extract a 3D patch with color information along the 3rd axis. To rebuild the image from all patches, reconstruct_from_patches_2d Use the. For example, it is used to generate a 4x4 pixel image with 3 color channels (eg RGB format).
>>> import numpy as np
>>> from sklearn.feature_extraction import image
>>> one_image = np.arange(4 * 4 * 3).reshape((4, 4, 3))
>>> one_image[:, :, 0] # R channel of a fake RGB picture
array([[ 0, 3, 6, 9],
[12, 15, 18, 21],
[24, 27, 30, 33],
[36, 39, 42, 45]])
>>> patches = image.extract_patches_2d(one_image, (2, 2), max_patches=2,
... random_state=0)
>>> patches.shape
(2, 2, 2, 3)
>>> patches[:, :, :, 0]
array([[[ 0, 3],
[12, 15]],
[[15, 18],
[27, 30]]])
>>> patches = image.extract_patches_2d(one_image, (2, 2))
>>> patches.shape
(9, 2, 2, 3)
>>> patches[4, :, :, 0]
array([[15, 18],
[27, 30]])
Let's try to reconstruct the original image from the patch by averaging the overlap area.
>>> reconstructed = image.reconstruct_from_patches_2d(patches, (4, 4, 3))
>>> np.testing.assert_array_equal(one_image, reconstructed)
PatchExtractor The class works the same way as extract_patches_2d, but Supports multiple images as input. Implemented as an estimator, it can be used in pipelines.
>>> five_images = np.arange(5 * 4 * 4 * 3).reshape(5, 4, 4, 3)
>>> patches = image.PatchExtractor((2, 2)).transform(five_images)
>>> patches.shape
(45, 2, 2, 3)
Some estimates of scikit-learn can use features or connection information between samples. For example, ward clustering (Hierarchical Clustering (http://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering)) can only cluster adjacent pixels in an image, and therefore , A continuous patch can be formed.
For this purpose, the estimator uses a "connectivity" matrix that indicates which samples are connected. The function img_to_graph is such a matrix from a 2D or 3D image. Returns. Similarly, grid_to_graph is given by the shape of these images. Build a concatenated matrix for the resulting images. These matrices can be used to impose connectivity on estimators that use connectivity information such as ward clustering (hierarchical clustering). It can also be used to build a precomputed kernel or similarity matrix.
--Example -[Demonstration of structured word clustering on scikit-learn face image](http://scikit-learn.org/stable/auto_examples/cluster/plot_face_ward_segmentation.html#sphx-glr-auto-examples-cluster-plot-face-ward -segmentation-py) -Spectrum Clustering for Image Segmentation -[Feature Agglomeration vs. Univariate Selection](http://scikit-learn.org/stable/auto_examples/cluster/plot_feature_agglomeration_vs_univariate_selection.html#sphx-glr-auto-examples-cluster-plot-feature-agglomeration-vs- univariate-selection-py)
[scikit-learn 0.18 User Guide 4. Dataset Conversion](http://qiita.com/nazoking@github/items/267f2371757516f8c168#4-%E3%83%87%E3%83%BC%E3%82%BF From% E3% 82% BB% E3% 83% 83% E3% 83% 88% E5% A4% 89% E6% 8F% 9B)
© 2010 --2016, scikit-learn developers (BSD license).
Recommended Posts