Create a company name extractor with python using JCLdic

TL;DR Create a morphological analysis (MeCab) based company name extractor in python using Japanese Company Lexicon. The environment assumes the following.

macOS Catalina
Homebrew 2.7.1
python 3.9

Advance preparation

Download JCLdic

https://github.com/chakki-works/Japanese-Company-Lexicon

Download the JCL_medium MeCab Dic from the README and unzip it. This file requires jcl_medium_mecab.dic.

MeCab installation

If you don't have mecab, please install it. This time install with brew. Use mecab-ipadic for the dictionary.

brew install mecab
brew install mecab-ipadic

MeCab userdict settings

Create a directory anywhere to put the dic file for MeCab's userdict settings. This time, I created it in / usr/local/lib/mecab/dic/user_dict. Move to the directory where you created the unzipped mecab dict jcl_medium_mecab.dic.

mkdir /usr/local/lib/mecab/dic/user_dict
mv jcl_slim_mecab.dic /usr/local/lib/mecab/dic/user_dict

change mecabrc

After preparing userdict, register mecabrc which is a configuration file of MeCab to change the dictionary information of mecab. The location of mecabrc may change depending on the installation method, but it is located in / usr/local/etc/mecabrc when installed with brew.

Change the commented ; userdic = <file path> with ; to the path of the file you put in ↑.

userdic = /usr/local/lib/mecab/dic/user_dict/jcl_slim_mecab.dic

Operation check

First, let's check if the dictionary is reflected in console.

>>> echo "I work in VisasQ." | mecab
VisasQ noun,Proprietary noun,Organization,*,*,*,VisasQ Co., Ltd.,*,*
Particles,Case particles,General,*,*,*,so,De,De
Working verb,Independence,*,*,Five-dan / Ka line,Continuous connection,work,Hatarai,Hatarai
Particles,Connection particle,*,*,*,*,hand,Te,Te
Verb,Non-independent,*,*,One step,Continuous form,Is,I,I
Auxiliary verb,*,*,*,Special / mass,Uninflected word,Trout,trout,trout
.. symbol,Punctuation,*,*,*,*,。,。,。
EOS

It is OK because VisasQ is displayed as noun, proper noun, organization, *, *, *, VisasQ Co., Ltd., *, *.

python Next, prepare to use MeCab with python.

library install First, install the library for python.

pip install mecab-python3

Now you are ready to go.

code Extract the company name with the following code.

import unicodedata
import MeCab

#MeCab settings
tagger =  MeCab.Tagger('-r /usr/local/etc/mecabrc')

def extract_company(text):
    #text normalize
    text = unicodedata.normalize('NFKC', text) 
    node = tagger.parseToNode(text)
    result = []
    while node:
     # node feature:Part of speech,Part of speech subclassification 1,Part of speech subclassification 2,Part of speech subclassification 3,Inflected form,Utilization type,Prototype,reading,pronunciation
        features = node.feature.split(',')
        if features[2] == 'Organization':
            result.append(
                (node.surface, features[6])
            )
        node = node.next
    return result

There are two points.

The first is to specify the mecabrc to be referenced in the argument of MeCab.Tagger with the -r option. The second is to normalize the text before parse it. As a result of the trade-off between dictionary size and search speed, JCLdic seems to use only half-width characters without using full-width characters, so it is necessary to normalize the parsed text to half-width characters.

In JCLdic, since the prototype contains a formal name such as VisasQ Co., Ltd., you can extract the official name of the company by extracting the prototype.

output

texts = [
    "I work as an engineer in VisasQ.",
    "Mitsubishi UFJ Morgan Stanley Securities M&Department A Associate Lincoln International Vice President Guardian Advisors Partner",
    "Canon Inc. General Manager / Management Supervision Office",
    "I have been engaged in product marketing of main products. He has spearheaded the planning and launch of "My Sony Club," which unifies membership services. In Synergy Marketing, we have provided support to client companies in the areas of marketing and marketing communications centered on CRM."
]

for text in texts:
    companies = extract_company(text)
    print("text: ", text)
    for company in companies:
        print("keyword: {},Official name: {}".format(company[0], company[1]))
text:I work as an engineer in VisasQ.
keyword:VisasQ,Official name:VisasQ Co., Ltd.
keyword:Engineers,Official name:Engineer Co., Ltd.

text:Mitsubishi UFJ Morgan Stanley Securities M&Department A Associate Lincoln International Vice President Guardian Advisors Partner
keyword:Mitsubishi UFJ Morgan Stanley Securities,Official name:Mitsubishi UFJ Morgan Stanley Securities Co., Ltd.
keyword: M&A,Official name:M Co., Ltd.&A
keyword:Associate,Official name:Associate Co., Ltd.
keyword:Lincoln International,Official name:Lincoln International Co., Ltd.
keyword:Weiss,Official name:Weiss Co., Ltd.
keyword:Guardian Advisors,Official name:Guardian Advisors Co., Ltd.

text:Canon Inc. General Manager / Management Supervision Office
keyword:Canon Inc,Official name:Canon Inc
keyword:Management supervision,Official name:Limited company management supervision

text:I have been engaged in product marketing of main products. He has spearheaded the planning and launch of "My Sony Club," which unifies membership services. In Synergy Marketing, we have provided support to client companies in the areas of marketing and marketing communications centered on CRM.
keyword: Sony,Official name:Sony GK
keyword:Synergy marketing,Official name:Synergy marketing株式会社
keyword:client,Official name:Cry Ant Co., Ltd.
keyword: CRM,Official name:C Co., Ltd..R.M.

Since it is a dictionary that contains many Japanese company names, it may be difficult to use depending on the application because the company names of general nouns appear. In that case, it is necessary to treat the keyword that you do not want to extract as a stopword and add a process to skip it if node.surface is a stopword.

Recommended Posts

Create a company name extractor with python using JCLdic
Create a directory with python
[Python] Create a ValueObject with a complete constructor using dataclasses
Create a python GUI using tkinter
Create a virtual environment with Python!
Create a Python function decorator with Class
Build a blockchain with Python ① Create a class
Create a dummy image with Python + PIL.
[Python] Create a virtual environment with Anaconda
Let's create a free group with Python
[Python] Create a Batch environment using AWS-CDK
Create a word frequency counter with Python 3.4
Create a tool to automatically furigana with html using Mecab from Python3
Create a record with attachments in KINTONE using the Python requests module
Create a LINE BOT with Minette for Python
Create a page that loads infinitely with python
[Note] Create a one-line timezone class with python
You can easily create a GUI with Python
Create a python3 build environment with Sublime Text3
Create a web map using Python and GDAL
Create a color bar with Python + Qt (PySide)
Steps to create a Twitter bot with python
Create a decision tree from 0 with Python (1. Overview)
Create a new page in confluence with Python
Create a color-specified widget with Python + Qt (PySide)
Create a Photoshop format file (.psd) with python
Create a Mac app using py2app and Python3! !!
Create a Python console application easily with Click
Create a Python module
Create a Python environment
Create a data collection bot in Python using Selenium
[CRUD] [Django] Create a CRUD site using the Python framework Django ~ 1 ~
Why not create a stylish table easily with Python?
Create a python development environment with vagrant + ansible + fabric
Register a ticket with redmine API using python requests
Create a Layer for AWS Lambda Python with Docker
[python] Create a date array with arbitrary increments with np.arange
[CRUD] [Django] Create a CRUD site using the Python framework Django ~ 2 ~
[Python] How to create a 2D histogram with Matplotlib
[Python] Create a Tkinter program distribution file with cx_Freeze
Create a fake Minecraft server in Python with Quarry
Using a Python program with fluentd's exec_filter Output Plugin
[CRUD] [Django] Create a CRUD site using the Python framework Django ~ 3 ~
[CRUD] [Django] Create a CRUD site using the Python framework Django ~ 4 ~
[CRUD] [Django] Create a CRUD site using the Python framework Django ~ 5 ~
Create a 2d CAD file ".dxf" with python [ezdxf]
Using a python program with fluentd's exec Output Plugin
Create a Wox plugin (Python)
Create a function in Python
Create a dictionary in Python
[S3] CRUD with S3 using Python [Python]
Create 3d gif with python3
[Python] Using OpenCV with Python (Basic)
Create a homepage with django
Create a python numpy array
Using a printer with Debian 10
Make a fortune with Python
Using OpenCV with Python @Mac
Send using Python with Gmail
[Python] Create a file & folder path specification screen with tkinter
Create a list in Python with all followers on twitter