Morphological analysis and tfidf (with test code) that can be done in about 1 minute

Preparation

pip install nltk
pip install mecab-python

Try pasting the code below and running it

The function to output TF-IDF is tfidf The function for morphological analysis is extract_words The long-running guy below the import unit test at the bottom is a test

#!/usr/bin/env python
#-*- encoding: utf-8 -*-
import nltk
import MeCab
import urllib2
from urllib2 import HTTPError
from itertools import chain


def tfidf(doc,docs):
  """If you specify the target document and the morphologically parsed word list of the whole sentence, the TF of the target document-Returns IDF"""
  tokens = list(chain.from_iterable(docs)) #flatten
  A = nltk.TextCollection(docs)
  token_types = set(tokens)
  return [{"word":token_type,"tfidf":A.tf_idf(token_type, doc)} for token_type in token_types]
    

def extract_words(text):
  """Given text, returns a list of nouns"""
  text =  text.encode("utf-8") if isinstance(text,unicode) else text
  mecab = MeCab.Tagger("")
  node = mecab.parseToNode(text)
  words = []
  while node:
    fs = node.feature.split(",")
    if (node.surface is not None) and node.surface != "" and fs[0] in [u'noun']:
      words.append(node.surface)
    node = node.next
  return words

import unittest

class MachineLearningTest(unittest.TestCase):
  def test_extract_words(self):
    """Morphological analysis test"""
    text = "Morphologically parse text and return a list of nouns"
    keywords = extract_words(text)
    self.assertEqual(keywords, ["text","morpheme","analysis","noun","list"])
  def test_tfidf(self):
    """tfidf test"""
    urls = ["http://qiita.com/puriketu99/items/"+str(i) for i in range(1,10)]
    def url2words(url):
      try:
        html = urllib2.urlopen(url).read()
      except HTTPError:
        html = ""
      plain_text = nltk.clean_html(html).replace('\n','')
      words = extract_words(plain_text)
      return words
    docs = [url2words(url) for url in urls]
    tfidfs_fizzbuzz = tfidf(docs[0],docs)
    tfidfs_fizzbuzz.sort(cmp=lambda x,y:cmp(x["tfidf"],y["tfidf"]),reverse=True)
    result = [e for i,e in enumerate(tfidfs_fizzbuzz) if len(e["word"]) > 2 and i < 30]
    self.assertEqual(result[7]["word"],"yaotti")#If Qiita side changes the design, the test may fail
    print result
    #[{'tfidf': 0.08270135278254376, 'word': 'quot'},
    # {'tfidf': 0.02819364299404901, 'word': 'FizzBuzz'},
    # {'tfidf': 0.02067533819563594, 'word': 'fizzbuzz'},
    # {'tfidf': 0.02067533819563594, 'word': 'Buzz'},
    # {'tfidf': 0.016916185796429405, 'word': 'Fizz'},
    # {'tfidf': 0.016726267030018446, 'word': 'end'},
    # {'tfidf': 0.015036609596826138, 'word': 'map'},
    # {'tfidf': 0.015036609596826138, 'word': 'yaotti'},
    # {'tfidf': 0.011277457197619604, 'word': 'def'}]

if __name__ == '__main__':
  unittest.main()

Reference Calculation of TF-IDF http://everydayprog.blogspot.jp/2011/12/tf-idf.html

Recommended Posts

Morphological analysis and tfidf (with test code) that can be done in about 1 minute
Text analysis that can be done in 5 minutes [Word Cloud]
It seems that Skeleton Tracking can be done with RealSense
I investigated the pretreatment that can be done with PyCaret
Make a Spinbox that can be displayed in Binary with Tkinter
About character string handling that can be placed in JSON communication
Can it be done in 1 minute? No installation required, Google Test sample for C language for Linux
Make a Spinbox that can be displayed in HEX with Tkinter
[For beginners] Baseball statistics and PyData that can be remembered in 33 minutes and 4 seconds ~ With Dai-Kang Yang
A script that retrieves tweets with Python, saves them in an external file, and performs morphological analysis.
EXCEL data bar and color scale can also be done with pandas
Serverless LINE Bot that can be done in 2 hours (source identifier acquisition)
[Can be done in 10 minutes] Create a local website quickly with Django
Draw a graph that can be moved around with HoloViews and Bokeh
A story that heroku that can be done in 5 minutes actually took 3 days
Visualize keywords in documents with TF-IDF and Word Cloud
Generate Word Cloud from case law data in python3
Jupyter in Cloud9 IDE
Text analysis that can be done in 5 minutes [Word Cloud]
[Flask & Bootstrap] Visualize the content of lyrics in Word Cloud ~ Lyrics Word Cloud ~
Pass PYTHONPATH in 1 minute with VS Code
Article that can be a human resource who understands and masters the mechanism of API (with Python code)
Easy program installer and automatic program updater that can be used in any language
I made a familiar function that can be used in statistics with Python
It can be achieved in 1 minute! Decorator that caches function execution results in memcached
List of tools that can be used to easily try sentiment analysis of Japanese sentences in Python (try with google colab)
Functions that can be used in for statements
Building Sphinx that can be written in Markdown
List packages that can be updated with pip
Summary of statistical data analysis methods using Python that can be used in business
Geographic information visualization of R and Python that can be expressed in Power BI
Set up an FTP server that can be created and destroyed immediately (in Python)
In Python3.8 and later, the inverse mod can be calculated with the built-in function pow.
The story that sendmail that can be executed in the terminal did not work with cron
A mechanism to call a Ruby method from Python that can be done in 200 lines
Basic algorithms that can be used in competition pros
Color list that can be set with tkinter (memorial)
Python knowledge notes that can be used with AtCoder
ANTs image registration that can be used in 5 minutes
Visualize keywords in documents with TF-IDF and Word Cloud
[Django] About users that can be used on template
Limits that can be analyzed at once with MeCab
Can be used with AtCoder! A collection of techniques for drawing short code in Python!
[Django] Field names, user registration, and login methods that can be used in the User model
[Python3] Code that can be used when you want to resize images in folder units
How to display hover text and text that can be executed by clicking with Minecraft plugin
Dealing with the error that HTTP fetch error occurs in gpg and the key cannot be obtained