TL;DR Create a morphological analysis (MeCab) based company name extractor in python using Japanese Company Lexicon. The environment assumes the following.
macOS Catalina
Homebrew 2.7.1
python 3.9
https://github.com/chakki-works/Japanese-Company-Lexicon
Download the JCL_medium MeCab Dic from the README and unzip it.
This file requires jcl_medium_mecab.dic
.
If you don't have mecab, please install it. This time install with brew. Use mecab-ipadic for the dictionary.
brew install mecab
brew install mecab-ipadic
Create a directory anywhere to put the dic file for MeCab's userdict settings.
This time, I created it in / usr/local/lib/mecab/dic/user_dict
.
Move to the directory where you created the unzipped mecab dict jcl_medium_mecab.dic
.
mkdir /usr/local/lib/mecab/dic/user_dict
mv jcl_slim_mecab.dic /usr/local/lib/mecab/dic/user_dict
After preparing userdict, register mecabrc which is a configuration file of MeCab to change the dictionary information of mecab.
The location of mecabrc may change depending on the installation method, but it is located in / usr/local/etc/mecabrc
when installed with brew.
Change the commented ; userdic = <file path>
with ;
to the path of the file you put in ↑.
userdic = /usr/local/lib/mecab/dic/user_dict/jcl_slim_mecab.dic
First, let's check if the dictionary is reflected in console.
>>> echo "I work in VisasQ." | mecab
VisasQ noun,Proprietary noun,Organization,*,*,*,VisasQ Co., Ltd.,*,*
Particles,Case particles,General,*,*,*,so,De,De
Working verb,Independence,*,*,Five-dan / Ka line,Continuous connection,work,Hatarai,Hatarai
Particles,Connection particle,*,*,*,*,hand,Te,Te
Verb,Non-independent,*,*,One step,Continuous form,Is,I,I
Auxiliary verb,*,*,*,Special / mass,Uninflected word,Trout,trout,trout
.. symbol,Punctuation,*,*,*,*,。,。,。
EOS
It is OK because VisasQ is displayed as noun, proper noun, organization, *, *, *, VisasQ Co., Ltd., *, *
.
python Next, prepare to use MeCab with python.
library install First, install the library for python.
pip install mecab-python3
Now you are ready to go.
code Extract the company name with the following code.
import unicodedata
import MeCab
#MeCab settings
tagger = MeCab.Tagger('-r /usr/local/etc/mecabrc')
def extract_company(text):
#text normalize
text = unicodedata.normalize('NFKC', text)
node = tagger.parseToNode(text)
result = []
while node:
# node feature:Part of speech,Part of speech subclassification 1,Part of speech subclassification 2,Part of speech subclassification 3,Inflected form,Utilization type,Prototype,reading,pronunciation
features = node.feature.split(',')
if features[2] == 'Organization':
result.append(
(node.surface, features[6])
)
node = node.next
return result
There are two points.
The first is to specify the mecabrc to be referenced in the argument of MeCab.Tagger with the -r option. The second is to normalize the text before parse it. As a result of the trade-off between dictionary size and search speed, JCLdic seems to use only half-width characters without using full-width characters, so it is necessary to normalize the parsed text to half-width characters.
In JCLdic, since the prototype contains a formal name such as VisasQ Co., Ltd., you can extract the official name of the company by extracting the prototype.
texts = [
"I work as an engineer in VisasQ.",
"Mitsubishi UFJ Morgan Stanley Securities M&Department A Associate Lincoln International Vice President Guardian Advisors Partner",
"Canon Inc. General Manager / Management Supervision Office",
"I have been engaged in product marketing of main products. He has spearheaded the planning and launch of "My Sony Club," which unifies membership services. In Synergy Marketing, we have provided support to client companies in the areas of marketing and marketing communications centered on CRM."
]
for text in texts:
companies = extract_company(text)
print("text: ", text)
for company in companies:
print("keyword: {},Official name: {}".format(company[0], company[1]))
text:I work as an engineer in VisasQ.
keyword:VisasQ,Official name:VisasQ Co., Ltd.
keyword:Engineers,Official name:Engineer Co., Ltd.
text:Mitsubishi UFJ Morgan Stanley Securities M&Department A Associate Lincoln International Vice President Guardian Advisors Partner
keyword:Mitsubishi UFJ Morgan Stanley Securities,Official name:Mitsubishi UFJ Morgan Stanley Securities Co., Ltd.
keyword: M&A,Official name:M Co., Ltd.&A
keyword:Associate,Official name:Associate Co., Ltd.
keyword:Lincoln International,Official name:Lincoln International Co., Ltd.
keyword:Weiss,Official name:Weiss Co., Ltd.
keyword:Guardian Advisors,Official name:Guardian Advisors Co., Ltd.
text:Canon Inc. General Manager / Management Supervision Office
keyword:Canon Inc,Official name:Canon Inc
keyword:Management supervision,Official name:Limited company management supervision
text:I have been engaged in product marketing of main products. He has spearheaded the planning and launch of "My Sony Club," which unifies membership services. In Synergy Marketing, we have provided support to client companies in the areas of marketing and marketing communications centered on CRM.
keyword: Sony,Official name:Sony GK
keyword:Synergy marketing,Official name:Synergy marketing株式会社
keyword:client,Official name:Cry Ant Co., Ltd.
keyword: CRM,Official name:C Co., Ltd..R.M.
Since it is a dictionary that contains many Japanese company names, it may be difficult to use depending on the application because the company names of general nouns appear.
In that case, it is necessary to treat the keyword that you do not want to extract as a stopword and add a process to skip it if node.surface
is a stopword.
Recommended Posts