I found a wonderful morphological analysis called SudachiPy, so I tried to call it from my usual PowerShell.
If you pipe a string, it will return the string you entered in the line
property and the object with the parsing result in the parsd
property.
The structure is that the main analysis process is written in Python and called from PowerShell.
It is also possible to use command line arguments and standard output with print
for input and output of strings, but since there are the following problems, we will use temporary files.
occurs when trying to
print` a character that cannot be represented by CP932.?
.If you get Python via Scoop, it will be easier to handle the path around it nicely. As a preliminary preparation, install SudachiPy and fire with pip
.
pip install sudachipy
pip install fire
The process of "morphologically analyzing the contents of a text file line by line and outputting the result to another text file" is put together in a function and made into a cli tool with fire.Fire ()
.
sudachi_tokenizer.py
import fire
import re
from sudachipy import tokenizer
from sudachipy import dictionary
def main(input_file_path, output_file_path, ignore_paren = False):
tokenizer_obj = dictionary.Dictionary().create()
mode = tokenizer.Tokenizer.SplitMode.C
with open(input_file_path, "r", encoding="utf_8_sig") as input_file:
all_lines = input_file.read()
lines = all_lines.splitlines()
json_style_list = []
for line in lines:
if not line:
json_style_list.append({"line": "", "parsed": []})
else:
if ignore_paren:
target = re.sub(r"\(.+?\)|\[.+?\]|(.+?)|[.+?]", "", line)
else:
target = line
tokens = tokenizer_obj.tokenize(target, mode)
parsed = []
for t in tokens:
surface = t.surface()
pos = t.part_of_speech()[0]
c_type = t.part_of_speech()[4]
c_form = t.part_of_speech()[5]
yomi = t.reading_form()
parsed.append({"surface": surface, "pos": pos, "yomi": yomi, "c_type": c_type, "c_form": c_form})
json_style_list.append({"line": line, "parsed": parsed})
with open(output_file_path, mode = "w", encoding="utf_8_sig") as output_file:
output_file.write(str(json_style_list))
if __name__ == "__main__":
fire.Fire(main)
For business purposes, I often skipped the round paren ()
()
and the bracket []
[]
, so I added an option.
The character code of the input / output file is attached with BOM because of the PowerShell specifications described later.
You can use the cmdlet from the console by creating the following .ps1
file in the same directory as the above sudachi_tokenizer.py
and reading it from $ PROFILE
.
function Invoke-SudachiTokenizer {
param (
[switch]$ignoreParen
)
$outputTmp = New-TemporaryFile
$inputTmp = New-TemporaryFile
$input | Out-File -Encoding utf8 -FilePath $inputTmp.FullName #With BOM
$sudachiPath = "{0}\sudachi_tokenizer.py" -f $PSScriptRoot
$command = 'python -B "{0}" "{1}" "{2}"' -f $sudachiPath, $inputTmp.FullName, $outputTmp.FullName
if ($ignoreParen) {
$command += ' --ignore_paren=True'
}
Invoke-Expression -Command $command
$parsed = Get-Content -Path $outputTmp.FullName -Encoding UTF8
@($inputTmp, $outputTmp) | Remove-Item #Manually clean up temporary files
return ($parsed | ConvertFrom-Json)
}
If you put the dictionary types in a list in Python, it will be in the same format as an array in json format, so I converted it to an object with ConvertFrom-Json
in PowerShell.
As I wrote in the comment, it is important to note that if you specify UTF8 in the -encoding
parameter of PowerShell, it will automatically have a BOM.
Recommended Posts