[RUBY] I tried morphological analysis with MeCab

It is a procedure up to the point of executing morphological analysis using Ruby (mecab gem) on Ubuntu (bionic).

First install.

apt install mecab mecab-ipadic-utf8 libmecab-dev
gem install mecab

You can try to output the analysis result with the following program.

require 'mecab'

tagger = MeCab::Tagger.new
puts tagger.parse(open('sample.txt').read)

This is a sample that parses the output result string and displays it in order of the number of times the word appears.

require 'mecab'

tagger = MeCab::Tagger.new
t = tagger.parse(open('sample.txt').read)
words = {}
t.split("\n").each do |l|
  w = l.split("\t")[0]
  c = words[w] || 0
  c += 1
  words[w] = c
end

words.sort {|a,b| a[1] <=> b[1]}.each do |v|
  puts v[0]+"\t"+v[1].to_s
end

In this example, part of speech is not taken into consideration, so ",. (Punctuation)" etc. are also included. I think that filtering etc. is necessary according to the purpose.

There seems to be a gem called natto, and it may be a good idea to use these powers. Also, if you want to easily try more specialized analysis methods, or if you want to visualize (graph), free software called KH Coder may be useful ( It seems that MeCab is still used internally).

--Reference: I tried using mecab

Addendum (20.06.13) I tried to improve the code according to the advice given in the comment section. The version of Ruby included in the standard of Ubuntu (bionic-beaver) was 2.5.1p57, so it is a form other than tally.

require 'mecab'

tagger = MeCab::Tagger.new
t = tagger.parse(IO.read('sample.txt'))
words = Hash.new(0)
t.split("\n").each do |l|
  w = l.split("\t")[0]
  words[w] += 1
end

words.sort_by {|a| a[1]}.each do |w,f|
  puts "%4d %s" % [f,w]
end

Recommended Posts

I tried morphological analysis with MeCab

English morphological analysis like MeCab with OpenNLP

Chinese morphological analysis like Mecab with FNLP

I tried DI with Ruby

I tried source code analysis

I tried UPSERT with PostgreSQL.

I tried BIND with Docker

Morphological analysis in Java with Kuromoji

I tried using JOOQ with Gradle

I tried to interact with Java

I tried UDP communication with Java