It is a procedure up to the point of executing morphological analysis using Ruby (mecab gem) on Ubuntu (bionic).
First install.
apt install mecab mecab-ipadic-utf8 libmecab-dev
gem install mecab
You can try to output the analysis result with the following program.
require 'mecab'
tagger = MeCab::Tagger.new
puts tagger.parse(open('sample.txt').read)
This is a sample that parses the output result string and displays it in order of the number of times the word appears.
require 'mecab'
tagger = MeCab::Tagger.new
t = tagger.parse(open('sample.txt').read)
words = {}
t.split("\n").each do |l|
w = l.split("\t")[0]
c = words[w] || 0
c += 1
words[w] = c
end
words.sort {|a,b| a[1] <=> b[1]}.each do |v|
puts v[0]+"\t"+v[1].to_s
end
In this example, part of speech is not taken into consideration, so ",. (Punctuation)" etc. are also included. I think that filtering etc. is necessary according to the purpose.
There seems to be a gem called natto, and it may be a good idea to use these powers. Also, if you want to easily try more specialized analysis methods, or if you want to visualize (graph), free software called KH Coder may be useful ( It seems that MeCab is still used internally).
--Reference: I tried using mecab
Addendum (20.06.13) I tried to improve the code according to the advice given in the comment section. The version of Ruby included in the standard of Ubuntu (bionic-beaver) was 2.5.1p57
, so it is a form other than tally.
require 'mecab'
tagger = MeCab::Tagger.new
t = tagger.parse(IO.read('sample.txt'))
words = Hash.new(0)
t.split("\n").each do |l|
w = l.split("\t")[0]
words[w] += 1
end
words.sort_by {|a| a[1]}.each do |w,f|
puts "%4d %s" % [f,w]
end
Recommended Posts