Try to extract high frequency words using NLTK (python)

While reading the official document of NLTK (Natural Language Toolkit), I tried to extract the words that are often used in the document. For the time being, I tried to display the keywords with high frequency from the sample data in order from the top, so I will leave it in the memo.

Development environment

NLTK installation

As you are familiar with other libraries, pip install first.

$ pip install nltk

Extract high-frequency words

The general flow is as follows: 1) After downloading the function to acquire the part of speech and the part of speech, 2) read the sample text, convert the read text to the word-separation, and 3) acquire the part of speech, and then the noun. Only the words in 4) are displayed, and finally, 4) the top three most used words are displayed.

Download required features

import nltk'punkt')'averaged_perceptron_tagger')

After importing nltk, download the function that divides the word and part of speech from the official website. Once downloaded in the environment, no further downloads are required. When I try to download it, I get an alert like Package punkt is already up-to-date!.

Get sample text and convert it to word-separated

raw = open('sample.txt').read()
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)

tokens_l = [w.lower() for w in tokens]

Prepare an English essay and a long sentence in advance. (Sample.txt) After reading this, convert it to word-separated with word_tokenize (). After that, in order to make them recognize the same if there is a difference between lowercase letters and uppercase letters, all lowercase letters are used to recognize the same thing as the same.

Extract only nouns after getting part of speech

only_nn = [x for (x,y) in pos if y in ('NN')]

freq = nltk.FreqDist(only_nn)

Only the part of speech corresponding to NN (noun) is extracted, and the frequency distribution is calculated using FreDist to count the number of frequent occurrences.

Show top 3


The display is completed using the function most_common () that counts the number of occurrences of Python and displays it from the most.

