Try to extract high frequency words using NLTK (python)

While reading the official document of NLTK (Natural Language Toolkit), I tried to extract the words that are often used in the document. For the time being, I tried to display the keywords with high frequency from the sample data in order from the top, so I will leave it in the memo.

Development environment

Python
NLTK

NLTK installation

As you are familiar with other libraries, pip install first.

$ pip install nltk

Extract high-frequency words

The general flow is as follows: 1) After downloading the function to acquire the part of speech and the part of speech, 2) read the sample text, convert the read text to the word-separation, and 3) acquire the part of speech, and then the noun. Only the words in 4) are displayed, and finally, 4) the top three most used words are displayed.

Download required features

`nltk_test.py`


import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

After importing nltk, download the function that divides the word and part of speech from the official website. Once downloaded in the environment, no further downloads are required. When I try to download it, I get an alert like Package punkt is already up-to-date!.

Get sample text and convert it to word-separated

`nltk_test.py`


raw = open('sample.txt').read()
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)

tokens_l = [w.lower() for w in tokens]

Prepare an English essay and a long sentence in advance. (Sample.txt) After reading this, convert it to word-separated with word_tokenize (). After that, in order to make them recognize the same if there is a difference between lowercase letters and uppercase letters, all lowercase letters are used to recognize the same thing as the same.

Extract only nouns after getting part of speech

`nltk_test.py`


only_nn = [x for (x,y) in pos if y in ('NN')]

freq = nltk.FreqDist(only_nn)

Only the part of speech corresponding to NN (noun) is extracted, and the frequency distribution is calculated using FreDist to count the number of frequent occurrences.

Show top 3

`nltk_test.py`


print(freq.most_common(3))

The display is completed using the function most_common () that counts the number of occurrences of Python and displays it from the most.

Recommended Posts

Try to extract high frequency words using NLTK (python)

Try to operate Excel using Python (Xlwings)

Try using Tweepy [Python2.7]

(Python) Try to develop a web application using Django

[Python] Try using Tkinter's canvas

Try to understand Python self

Try using Kubernetes Client -Python-

Start to Selenium using python

Try to make it using GUI and PyQt in Python

Try to operate an Excel file using Python (Pandas / XlsxWriter) ②

How to install python using anaconda

Try to operate Facebook with Python

Try using Pleasant's API (python / FastAPI)

Try to extract a character string from an image with Python3

Try using LevelDB in Python (plyvel)

Try using pynag to configure Nagios

Try to analyze online family mahjong using Python (PART 1: Take DATA)

Try to calculate Trace in Python

Try converting cloudmonkey CLI to python3 -1

Try to log in to Netflix automatically using python on your PC

Try to get statistics using e-Stat

Extract the targz file using python

Try using Python argparse's action API

Try to make capture software with as high accuracy as possible with python (1)

Try using the Python Cmd module

Try frequency control simulation with Python

Try using Leap Motion in Python

Try using Amazon DynamoDB from Python

Try using the Python web framework Django (1)-From installation to server startup

Try to solve a set problem of high school math with Python

[Python] [Word] [python-docx] Try to create a template of a word sentence in Python using python-docx

Try to poke DB on IBM i with python + JDBC using JayDeBeApi

Try to reproduce color film with Python

[Python] Use pandas to extract △△ that maximizes ○○

From Python to using MeCab (and CaboCha)

Try mathematical formulas using Σ with python

Introduction to Discrete Event Simulation Using Python # 1

Try using the Kraken API in Python

Try using Dialogflow (formerly API.AI) Python SDK #dialogflow

Try using Python with Google Cloud Functions

Try to detect fusion movement using AnyMotion

Log in to Slack using requests in Python

Dump BigQuery tables to GCS using Python

Python amateurs try to summarize the list ①

Introduction to Discrete Event Simulation Using Python # 2

Try using Junos On-box Python # 1 Op Script

Try to download Youtube videos using Pytube

Try python

First steps to try Google CloudVision in Python

Try to implement Oni Maitsuji Miserable in python

Try sending Metrics to datadog via python, DogStatsD

Try to calculate a statistical problem in Python

3.14 π day, so try to output in Python

Try using django-import-export to add csv data to django

Try auto to automatically price Enums in Python 3.6

#Monte Carlo method to find pi using Python

Procedure to use TeamGant's WEB API (using python)

Try to solve the Python class inheritance problem

Try to separate Controllers using Blueprint in Flask

Introducing 4 ways to monitor Python applications using Prometheus

I want to email from Gmail using Python.