Click here until yesterday
You will become an engineer in 100 days --Day 66 --Programming --About natural language processing
You will become an engineer in 100 days --Day 63 --Programming --Probability 1
You will become an engineer in 100 days-Day 59-Programming-Algorithms
You will become an engineer in 100 days --- Day 53 --Git --About Git
You will become an engineer in 100 days --Day 42 --Cloud --About cloud services
You will become an engineer in 100 days --Day 36 --Database --About the database
You will be an engineer in 100 days-Day 24-Python-Basics of Python language 1
You will become an engineer in 100 days --Day 18 --Javascript --JavaScript basics 1
You will become an engineer in 100 days --Day 14 --CSS --CSS Basics 1
You will become an engineer in 100 days --Day 6 --HTML --HTML basics 1
This time about TF-IDF.
TF-IDF
is the index term frequency and the reverse document frequency.
TF (Term Frequency)
word frequency and
ʻIDF (Inverse Document Frequency)` The rarity of words
It will be a product.
reference: https://ja.wikipedia.org/wiki/Tf-idf
First, let's count the number of words in the sentence.
Make a sentence for counting.
result_list = []
result_list.append('I am a cat')
result_list.append('I am a cat')
result_list.append('I am also')
result_list.append('Please, please be a cat')
You can count the frequency of occurrence of words with the following code.
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
count_vectorizer = CountVectorizer(token_pattern='(?u)\\b\\w+\\b')
count_vectorizer.fit(result_list)
X = count_vectorizer.transform(result_list)
print(len(count_vectorizer.vocabulary_))
print(count_vectorizer.vocabulary_)
pd.DataFrame(X.toarray(), columns=count_vectorizer.get_feature_names())
8 {'I'm: 6,'is': 4,'cat': 7,'in': 1,'is': 0,'also': 5,'is': 2,'please': 3}
is there | so | is | here you go | Is | Also | I | Cat | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 1 |
1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 1 |
2 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 |
3 | 0 | 0 | 1 | 2 | 0 | 0 | 0 | 1 |
You can count how many times a word appears in each sentence.
Next, let's find TF-IDF.
You can find it with the following code.
from sklearn.feature_extraction.text import TfidfVectorizer
import warnings
import numpy as np
import pandas as pd
warnings.filterwarnings('ignore')
tfidf_vectorizer = TfidfVectorizer(token_pattern='(?u)\\b\\w+\\b')
tfidf_vectorizer.fit(result_list)
print(len(tfidf_vectorizer.vocabulary_))
print(tfidf_vectorizer.vocabulary_)
X = tfidf_vectorizer.fit_transform(result_list)
pd.DataFrame(X.toarray(), columns=tfidf_vectorizer.get_feature_names())
8 {'I'm: 6,'is': 4,'cat': 7,'in': 1,'is': 0,'also': 5,'is': 2,'please': 3}
is there | so | is | here you go | Is | Also | I | Cat | |
---|---|---|---|---|---|---|---|---|
0 | 0.481635 | 0.481635 | 0 | 0 | 0.481635 | 0 | 0.389925 | 0.389925 |
1 | 0.481635 | 0.481635 | 0 | 0 | 0.481635 | 0 | 0.389925 | 0.389925 |
2 | 0 | 0 | 0.553492 | 0 | 0 | 0.702035 | 0.4481 | 0 |
3 | 0 | 0 | 0.35157 | 0.891844 | 0 | 0 | 0 | 0.284626 |
TF-IDF has a value between 0 and 1. The value of what appears in many sentences is small. What appears a lot in one sentence is considered to be an important word.
You can think of it as a word that is more rare as it gets closer to 1.
There are methods for vectorizing sentences and methods for calculating the rarity of words. Since the sentences can be quantified, you will be able to perform various calculations.
Because these methods are often used in machine learning etc. Let's suppress the name and so on.
32 days until you become an engineer
Otsu py's HP: http://www.otupy.net/
Youtube: https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw
Twitter: https://twitter.com/otupython
Recommended Posts