[PowerShell] Morphological analysis with SudachiPy

I found a wonderful morphological analysis called SudachiPy, so I tried to call it from my usual PowerShell.

The finished product

If you pipe a string, it will return the string you entered in the line property and the object with the parsing result in the parsd property.

code

The structure is that the main analysis process is written in Python and called from PowerShell. It is also possible to use command line arguments and standard output with print for input and output of strings, but since there are the following problems, we will use temporary files.

Upper limit of arguments
Is the limit of several hundred lines?
String escape
The processing is complicated when quotation marks and tab characters are included.
Character code problem
In a Windows environment, ʻUnicodeEncodeError occurs when trying to print` a character that cannot be represented by CP932.
The only way to avoid it is to ignore the character or replace it with ?.

Processing on the Python side

If you get Python via Scoop, it will be easier to handle the path around it nicely. As a preliminary preparation, install SudachiPy and fire with pip.

pip install sudachipy
pip install fire

The process of "morphologically analyzing the contents of a text file line by line and outputting the result to another text file" is put together in a function and made into a cli tool with fire.Fire ().

`sudachi_tokenizer.py`


import fire
import re
from sudachipy import tokenizer
from sudachipy import dictionary

def main(input_file_path, output_file_path, ignore_paren = False):
    tokenizer_obj = dictionary.Dictionary().create()
    mode = tokenizer.Tokenizer.SplitMode.C

    with open(input_file_path, "r", encoding="utf_8_sig") as input_file:
        all_lines = input_file.read()
    lines = all_lines.splitlines()

    json_style_list = []
    for line in lines:
        if not line:
            json_style_list.append({"line": "", "parsed": []})
        else:
            if ignore_paren:
                target = re.sub(r"\(.+?\)|\[.+?\]|（.+?）|［.+?］", "", line)
            else:
                target = line
            tokens = tokenizer_obj.tokenize(target, mode)
            parsed = []
            for t in tokens:
                surface = t.surface()
                pos = t.part_of_speech()[0]
                c_type = t.part_of_speech()[4]
                c_form = t.part_of_speech()[5]
                yomi = t.reading_form()
                parsed.append({"surface": surface, "pos": pos, "yomi": yomi, "c_type": c_type, "c_form": c_form})
            json_style_list.append({"line": line, "parsed": parsed})
    with open(output_file_path, mode = "w", encoding="utf_8_sig") as output_file:
        output_file.write(str(json_style_list))

if __name__ == "__main__":
    fire.Fire(main)

For business purposes, I often skipped the round paren () () and the bracket [] [], so I added an option.

The character code of the input / output file is attached with BOM because of the PowerShell specifications described later.

Processing on the PowerShell side

You can use the cmdlet from the console by creating the following .ps1 file in the same directory as the above sudachi_tokenizer.py and reading it from $ PROFILE.

function Invoke-SudachiTokenizer {
    param (
        [switch]$ignoreParen
    )

    $outputTmp = New-TemporaryFile
    $inputTmp = New-TemporaryFile
    $input | Out-File -Encoding utf8 -FilePath $inputTmp.FullName #With BOM

    $sudachiPath = "{0}\sudachi_tokenizer.py" -f $PSScriptRoot
    $command = 'python -B "{0}" "{1}" "{2}"' -f $sudachiPath, $inputTmp.FullName, $outputTmp.FullName
    if ($ignoreParen) {
        $command += ' --ignore_paren=True'
    }

    Invoke-Expression -Command $command
    $parsed = Get-Content -Path $outputTmp.FullName -Encoding UTF8

    @($inputTmp, $outputTmp) | Remove-Item #Manually clean up temporary files

    return ($parsed | ConvertFrom-Json)
}

If you put the dictionary types in a list in Python, it will be in the same format as an array in json format, so I converted it to an object with ConvertFrom-Json in PowerShell.

As I wrote in the comment, it is important to note that if you specify UTF8 in the -encoding parameter of PowerShell, it will automatically have a BOM.