TL;DR Create a morphological analysis (MeCab) based company name extractor in python using Japanese Company Lexicon. The environment assumes the following.

macOS Catalina
Homebrew 2.7.1
python 3.9

Advance preparation

Download JCLdic

https://github.com/chakki-works/Japanese-Company-Lexicon

Download the JCL_medium MeCab Dic from the README and unzip it. This file requires jcl_medium_mecab.dic.

MeCab installation

If you don't have mecab, please install it. This time install with brew. Use mecab-ipadic for the dictionary.

brew install mecab
brew install mecab-ipadic

MeCab userdict settings

Create a directory anywhere to put the dic file for MeCab's userdict settings. This time, I created it in / usr/local/lib/mecab/dic/user_dict. Move to the directory where you created the unzipped mecab dict jcl_medium_mecab.dic.

mkdir /usr/local/lib/mecab/dic/user_dict
mv jcl_slim_mecab.dic /usr/local/lib/mecab/dic/user_dict

change mecabrc

After preparing userdict, register mecabrc which is a configuration file of MeCab to change the dictionary information of mecab. The location of mecabrc may change depending on the installation method, but it is located in / usr/local/etc/mecabrc when installed with brew.

Change the commented ; userdic = <file path> with ; to the path of the file you put in ↑.

userdic = /usr/local/lib/mecab/dic/user_dict/jcl_slim_mecab.dic

Operation check

First, let's check if the dictionary is reflected in console.

>>> echo "I work in VisasQ." | mecab
VisasQ noun,Proprietary noun,Organization,*,*,*,VisasQ Co., Ltd.,*,*
Particles,Case particles,General,*,*,*,so,De,De
Working verb,Independence,*,*,Five-dan / Ka line,Continuous connection,work,Hatarai,Hatarai
Particles,Connection particle,*,*,*,*,hand,Te,Te
Verb,Non-independent,*,*,One step,Continuous form,Is,I,I
Auxiliary verb,*,*,*,Special / mass,Uninflected word,Trout,trout,trout
.. symbol,Punctuation,*,*,*,*,。,。,。
EOS

It is OK because VisasQ is displayed as noun, proper noun, organization, *, *, *, VisasQ Co., Ltd., *, *.

python Next, prepare to use MeCab with python.

library install First, install the library for python.

pip install mecab-python3

Now you are ready to go.

code Extract the company name with the following code.

import unicodedata
import MeCab

#MeCab settings
tagger =  MeCab.Tagger('-r /usr/local/etc/mecabrc')

def extract_company(text):
    #text normalize
    text = unicodedata.normalize('NFKC', text) 
    node = tagger.parseToNode(text)
    result = []
    while node:
　　　　　# node feature:Part of speech,Part of speech subclassification 1,Part of speech subclassification 2,Part of speech subclassification 3,Inflected form,Utilization type,Prototype,reading,pronunciation
        features = node.feature.split(',')
        if features[2] == 'Organization':
            result.append(
                (node.surface, features[6])
            )
        node = node.next
    return result

There are two points.

The first is to specify the mecabrc to be referenced in the argument of MeCab.Tagger with the -r option. The second is to normalize the text before parse it. As a result of the trade-off between dictionary size and search speed, JCLdic seems to use only half-width characters without using full-width characters, so it is necessary to normalize the parsed text to half-width characters.

In JCLdic, since the prototype contains a formal name such as VisasQ Co., Ltd., you can extract the official name of the company by extracting the prototype.

output

texts = [
    "I work as an engineer in VisasQ.",
    "Mitsubishi UFJ Morgan Stanley Securities M&Department A Associate Lincoln International Vice President Guardian Advisors Partner",
    "Canon Inc. General Manager / Management Supervision Office",
    "I have been engaged in product marketing of main products. He has spearheaded the planning and launch of "My Sony Club," which unifies membership services. In Synergy Marketing, we have provided support to client companies in the areas of marketing and marketing communications centered on CRM."
]

for text in texts:
    companies = extract_company(text)
    print("text: ", text)
    for company in companies:
        print("keyword: {},Official name: {}".format(company[0], company[1]))

text:I work as an engineer in VisasQ.
keyword:VisasQ,Official name:VisasQ Co., Ltd.
keyword:Engineers,Official name:Engineer Co., Ltd.

text:Mitsubishi UFJ Morgan Stanley Securities M&Department A Associate Lincoln International Vice President Guardian Advisors Partner
keyword:Mitsubishi UFJ Morgan Stanley Securities,Official name:Mitsubishi UFJ Morgan Stanley Securities Co., Ltd.
keyword: M&A,Official name:M Co., Ltd.&A
keyword:Associate,Official name:Associate Co., Ltd.
keyword:Lincoln International,Official name:Lincoln International Co., Ltd.
keyword:Weiss,Official name:Weiss Co., Ltd.
keyword:Guardian Advisors,Official name:Guardian Advisors Co., Ltd.

text:Canon Inc. General Manager / Management Supervision Office
keyword:Canon Inc,Official name:Canon Inc
keyword:Management supervision,Official name:Limited company management supervision

text:I have been engaged in product marketing of main products. He has spearheaded the planning and launch of "My Sony Club," which unifies membership services. In Synergy Marketing, we have provided support to client companies in the areas of marketing and marketing communications centered on CRM.
keyword: Sony,Official name:Sony GK
keyword:Synergy marketing,Official name:Synergy marketing株式会社
keyword:client,Official name:Cry Ant Co., Ltd.
keyword: CRM,Official name:C Co., Ltd..R.M.

Since it is a dictionary that contains many Japanese company names, it may be difficult to use depending on the application because the company names of general nouns appear. In that case, it is necessary to treat the keyword that you do not want to extract as a stopword and add a process to skip it if node.surface is a stopword.

Create a company name extractor with python using JCLdic