> "Natto (not genetically modified)"|Get-ReadingWithSudachi|fl
Line :Natto (not genetically modified)
Reading :Natto (Idenshikumakaedenai)
Tokenize :Natto(Natto)/(/gene(Idenshi)/Recombinant(Kumikae)/so/Absent/)
Markup : <p><ruby>Natto<rt>Natto</rt></ruby>(<ruby>gene<rt>Idenshi</rt></ruby>
<ruby>Recombinant<rt>Kumikae</rt></ruby>Not)</p>
environment:
> $PSVersionTable
Name Value
---- -----
PSVersion 7.0.3
PSEdition Core
GitCommitId 7.0.3
OS Microsoft Windows 10.0.18362
Platform Win32NT
PSCompatibleVersions {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion 2.3
SerializationVersion 1.1.0.1
WSManStackVersion 3.0
Call the previously written morphological analysis with SudachiPy ([PowerShell] morphological analysis with SudachiPy).
function Get-ReadingWithSudachi {
param (
[switch]$readingOnly,
[switch]$ignoreParen
)
$ret = New-Object System.Collections.ArrayList
$tokenizedResults = $input | Invoke-SudachiTokenizer -ignoreParen:$ignoreParen
foreach ($result in $tokenizedResults) {
$reading = New-Object System.Text.StringBuilder
$tokenize = New-Object System.Collections.ArrayList
$markup = New-Object System.Collections.ArrayList
foreach ($token in $result.parsed) {
$tokenSurface = $token.surface
if ($token.pos -match "symbol|Blank" -or $tokenSurface -match "^([A-Vu]|[a-zA-Za-zA-Z]|[0-90-9]|[\W\s])+$") {
$tokenReading = $tokenSurface
$tokenInfo = $tokenSurface
$tokenMarkup = $tokenSurface
}
elseif (-not $token.reading) {
$tokenReading = $tokenSurface
$tokenInfo = "$($tokenSurface)(?)"
$tokenMarkup = $tokenSurface
}
else {
$tokenReading = $token.reading
$tokenInfo = ($tokenSurface -match "^[Ah-Hmm]+$")?
$tokenSurface :
"$($tokenSurface)($tokenReading)"
$tokenMarkup = ($tokenSurface -match "^[Ah-Hmm]+$")?
$tokenSurface :
"<ruby>{0}<rt>{1}</rt></ruby>" -f $tokenSurface, $tokenReading
}
$reading.Append($tokenReading) > $null
$tokenize.Add($tokenInfo) > $null
$markup.Add($tokenMarkup) > $null
}
$ret.Add([PSCustomObject]@{
Line = $result.line
Reading = $reading.ToString()
Tokenize = $tokenize -join "/"
Markup = "<p>{0}</p>" -f ($markup -join "")
}) > $null
}
return ($readingOnly)? $ret.reading : $ret
}
Sometimes I fail to analyze technical terms like this.
If you have one or two, you can check it visually, but since it would be a problem to process hundreds of lines, I added a property called Markup
to spit out html markup.
(cat hogehoge.txt |Get-ReadingWithSudachi).markup|Out-File hogehoge.html
I believe that if you convert it to html as described above and check it with a browser, oversights will be reduced to some extent.
Recommended Posts