This article introduces how to use the company name dictionary (JCLdic).
JCLdic contains over 8 million company names and their aliases. This dictionary was created to solve the problem that the coverage of company names is low in conventional dictionaries and it is difficult to recognize due to notational fluctuations.
Download MeCab Dic using JCL_slim as an example.
Please install MeCab and mecab-python3 first.
Move the downloaded jcl_slim_mecab.dic
to the specified folder.
$ mkdir /usr/local/lib/mecab/dic/user_dict
$ mv jcl_slim_mecab.dic /usr/local/lib/mecab/dic/user_dict
Update the MeCab configuration file mecabrc
and write the dictionary path.
$ vim /usr/local/etc/mecabrc
In mecabrc
, the dicdir
system dictionary path, ʻuserdic is the user dictionary path. Write the JCLdic path in ʻuserdic
.
dicdir = /usr/local/lib/mecab/dic/ipadic
;dicdir = /usr/local/lib/mecab/dic/mecab-ipadic-neologd
;dicdir = /usr/local/lib/mecab/dic/jumandic
;dicdir = /usr/local/lib/mecab/dic/unidic
userdic = /usr/local/lib/mecab/dic/user_dict/jcl_slim_mecab.dic
; output-format-type = wakati
; input-buffer-size = 8192
; node-format = %m\n
; bos-format = %S\n
; eos-format = EOS\n
You can also specify paths for multiple user dictionaries.
userdic = /usr/local/lib/mecab/dic/user_dict/jcl_full_mecab_1.dic,/usr/local/lib/mecab/dic/user_dict/jcl_full_mecab_2.dic
Now you're ready to go.
Result of not using jcl_slim_mecab.dic
:
echo "TIS Co., Ltd. of the TIS INTEC Group has released JCLdic (Japanese company name dictionary), a dictionary for recognizing company names in natural language processing, free of charge." | mecab
TIS noun,General,*,*,*,*,*
Intec noun,Proper noun,Organization,*,*,*,INTEC,INTEC,INTEC
Group noun,General,*,*,*,*,group,group,group
Particles,Attributive,*,*,*,*,of,No,No
TIS noun,General,*,*,*,*,*
Noun Co., Ltd.,General,*,*,*,*,Co., Ltd.,Kabushiki Gaisha,Kabushiki Gaisha
Is a particle,Particle,*,*,*,*,Is,C,Wow
, Symbol,Comma,*,*,*,*,、,、,、
......
EOS
Result of using jcl_slim_mecab.dic
:
echo "TIS Co., Ltd. of the TIS INTEC Group has released JCLdic (Japanese company name dictionary), a dictionary for recognizing company names in natural language processing, free of charge." | mecab
TIS noun,Proper noun,Organization,*,*,*,TIS Co., Ltd.,*,*
Intec noun,Proper noun,Organization,*,*,*,INTEC Inc.,*,*
Group noun,General,*,*,*,*,group,group,group
Particles,Attributive,*,*,*,*,of,No,No
TIS Co., Ltd. Noun,Proper noun,Organization,*,*,*,TIS Co., Ltd.,*,*
Is a particle,Particle,*,*,*,*,Is,C,Wow
, Symbol,Comma,*,*,*,*,、,、,、
......
EOS
You can also specify a user dictionary.
echo "TIS Co., Ltd. of the TIS INTEC Group has released JCLdic (Japanese company name dictionary), a dictionary for recognizing company names in natural language processing, free of charge." | mecab -u /usr/local/lib/mecab/dic/user_dict/jcl_medium_mecab.dic
Recognize the company name.
parse
methodimport unicodedata
import MeCab
# 1 specify dictionary by option
# tagger = MeCab.Tagger('-u /usr/local/lib/mecab/dic/user_dict/jcl_slim_mecab.dic')
# 2 import multiple dictionaries by mecabrc
tagger = MeCab.Tagger('-r /usr/local/etc/mecabrc')
text = 'TIS Co., Ltd. of the TIS INTEC Group has released JCLdic (Japanese company name dictionary), a dictionary for recognizing company names in natural language processing, free of charge.'
# convert zenkaku to hankaku
text = unicodedata.normalize('NFKC', text)
# parse
print(tagger.parse(text))
result:
TIS noun,Proper noun,Organization,*,*,*,TIS Co., Ltd.,*,*
Intec noun,Proper noun,Organization,*,*,*,INTEC Inc.,*,*
Group noun,General,*,*,*,*,group,group,group
Particles,Attributive,*,*,*,*,of,No,No
TIS Co., Ltd. Noun,Proper noun,Organization,*,*,*,TIS Co., Ltd.,*,*
Is a particle,Particle,*,*,*,*,Is,C,Wow
, Symbol,Comma,*,*,*,*,、,、,、
...
EOS
parseToNode
methodRecognize the company name entity with the organization
keyword.
import unicodedata
import MeCab
# 1 specify dictionary by option
# tagger = MeCab.Tagger('-u /usr/local/lib/mecab/dic/user_dict/jcl_slim_mecab.dic')
# 2 import multiple dictionaries by mecabrc
tagger = MeCab.Tagger('-r /usr/local/etc/mecabrc')
text = 'TIS Co., Ltd. of the TIS INTEC Group has released JCLdic (Japanese company name dictionary), a dictionary for recognizing company names in natural language processing, free of charge.'
# convert zenkaku to hankaku
text = unicodedata.normalize('NFKC', text)
# parse
node = tagger.parseToNode(text)
result = []
while node:
# node feature map:Part of speech,Part of speech subclassification 1,Part of speech subclassification 2,Part of speech subclassification 3,Inflected form,Utilization type,Prototype,reading,pronunciation
# example: TIS: ['noun', '固有noun', 'Organization', '*', '*', '*', 'TIS Co., Ltd.', '*', '*']
if node.feature.split(",")[2] == 'Organization':
result.append(node.surface)
node = node.next
print(result)
# ['TIS', 'INTEC', 'TIS Co., Ltd.']