There are cases where I want to use an existing text analysis function written in another language for text analysis, so I tried it. Let's call a Python-based document initialization tool called neologdn from MATLAB. I'm new to Python, so I'm sorry if I make a lot of mistakes.
MATLAB R2020a Python 3.6
There is an official page called "Calling Python Library Functions", so prepare by referring to this. Both MATLAB and Python environments are required, but even if you say Python in a word, there is one that supports calling from MATLAB, and that seems to be easier, so as per the official page I installed it.
Enter the following on the MATLAB side as a trial.
MATLAB
py.os.listdir('.')
Then, I was able to display the list of files using os.listdir on the Python side.
Next, prepare to use neologdn, a tool that normalizes Japanese.
neologdn is a Japanese text normalizer for mecab-neologd. The normalization is based on the neologd's rules: https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja
Install neologd.
command prompt
py -m pip install neologdn
You are now ready.
Let's run the example sentence in the neologd readme in MATLAB.
MATLAB
>> py.neologdn.normalize("Hankaku Kana")
ans =
Python str has no properties.
Handkerchief
>> py.neologdn.normalize("Double-byte symbol! ?? @ #")
ans =
Python str has no properties.
Double-byte symbol!?@#
>> py.neologdn.normalize("Double-byte symbol exception "・"")
ans =
Python str has no properties.
Double-byte symbol exception "・"
>> py.neologdn.normalize("Long vowel shortening way")
ans =
Python str has no properties.
Long vowel shortening way
>> py.neologdn.normalize("Tilde Delete We~~ ∾ ~ 〰 ~ i")
ans =
Python str has no properties.
Tilde removal way
>> py.neologdn.normalize("Various hyphens ˗֊ ------ – ⁃⁻₋−")
ans =
Python str has no properties.
Various hyphens-
>> py.neologdn.normalize("PRML supplementary reading book")
ans =
Python str has no properties.
PRML supplementary reader
>> py.neologdn.normalize(" Natural Language Processing ")
ans =
Python str has no properties.
Natural Language Processing
>> py.neologdn.normalize("Cute good good good", pyargs('repeat',6))
ans =
Python str has no properties.
Cute good good
>> py.neologdn.normalize("Waste Waste Waste Waste", pyargs('repeat',1))
ans =
Python str has no properties.
Waste
>>
You can process it according to the readme. By the way, the result seems to be returned in str type.
Before dividing it into tokens with Text Analytics Toolbox, it would be convenient to be able to normalize it like this.
Recommended Posts