While reading the official document of NLTK (Natural Language Toolkit), I tried to extract the words that are often used in the document. For the time being, I tried to display the keywords with high frequency from the sample data in order from the top, so I will leave it in the memo.
As you are familiar with other libraries, pip install first.
$ pip install nltk
The general flow is as follows: 1) After downloading the function to acquire the part of speech and the part of speech, 2) read the sample text, convert the read text to the word-separation, and 3) acquire the part of speech, and then the noun. Only the words in 4) are displayed, and finally, 4) the top three most used words are displayed.
nltk_test.py
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
After importing nltk, download the function that divides the word and part of speech from the official website. Once downloaded in the environment, no further downloads are required. When I try to download it, I get an alert like Package punkt is already up-to-date!
.
nltk_test.py
raw = open('sample.txt').read()
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)
tokens_l = [w.lower() for w in tokens]
Prepare an English essay and a long sentence in advance. (Sample.txt) After reading this, convert it to word-separated with word_tokenize ()
. After that, in order to make them recognize the same if there is a difference between lowercase letters and uppercase letters, all lowercase letters are used to recognize the same thing as the same.
nltk_test.py
only_nn = [x for (x,y) in pos if y in ('NN')]
freq = nltk.FreqDist(only_nn)
Only the part of speech corresponding to NN (noun) is extracted, and the frequency distribution is calculated using FreDist
to count the number of frequent occurrences.
nltk_test.py
print(freq.most_common(3))
The display is completed using the function most_common ()
that counts the number of occurrences of Python and displays it from the most.
Recommended Posts