In recent years, the development of natural language processing technology has been remarkable, and its application is being promoted in various fields. I often do work that utilizes natural language processing technology and AI, but the most troublesome (but important) work is related to various pre-processing.
Some of the main pre-processing you'll be doing for most tasks include:
I mainly use Python, but I didn't have a suitable library for ** Japanese sentence breaks **, so I ended up writing similar code every time. I'm sure there are about 100 people in the world who have similar problems, so I decided to write my own library and publish it as OSS, but it was the beginning of 2019. It's about time. However, I couldn't secure enough time and motivation, and it was delayed, but I was finally able to start by setting the limit of writing articles on the Advent Calendar.
I think the following are more commonly used as simple sentence delimiters.
However, there are many actual documents that cannot be separated well by the above simple rules.
For example, I answered," Yes, that's right. " If you simply separate text like
by a punctuation mark, it will be split as follows:
I think there are some good situations, but I answered, "Yes, that's right." You may want to treat it as one sentence, `.
For example, for reasons such as not fitting on one screen, line breaks may occur in the middle of a sentence as shown below (especially for documents in a company).
In natural language processing, ~ omitted ~
It is commonly used.
If this is separated by a line break, it will be divided into two sentences, but in natural language processing, it is common to use ~ omitted ~. You may want to separate it as one sentence,
.
In the above example, if you delete the line breaks and then separate them with punctuation marks, you can do something about it, but ** contains sentences that do not have punctuation marks **, which makes it much more troublesome. (~~ Please add a punctuation mark ... ~~)
>>I was planning to go to the barber shop tomorrow, but "in a hurry
>>Change the schedule. Therefore, at the meeting
>>Let me change the schedule. ".
I've acknowledged.
Cases where there are line breaks and unnecessary symbols at the beginning of lines in the middle of a sentence. There is a theory that it is the most common case in corporate documents (subjective). The easiest approach is to remove the symbols and line breaks first and then process them. However, it is rare that you want to combine them while removing unnecessary symbols and leaving the information that they are a block of citations.
GiNZA GiNZA is a library that can be used to separate Japanese sentences in Python. Sentence delimiters using GiNZA can be done as follows.
import spacy
nlp = spacy.load('ja_ginza')
doc = nlp('I was told, "I can't answer your thoughts. I want you to hit others." Stunned\n I had no choice but to stand there, but I still want to believe!')
for sent in doc.sents:
print(sent)
I said, "I can't answer your thoughts.
I want you to hit the other.
"They said!
I was stunned and had no choice but to stand there
Still I want to believe!
The advantage of using GiNZA is that it can detect sentence breaks with high accuracy even if line breaks are made in the middle of a sentence or punctuation is omitted because the dependency analysis is performed properly. .. Although it is a heavyweight class, I think it is a good option if you also use other functions of GiNZA.
sentence-splitter It is a tool made by Node.js, but there is also sentence-splitter.
echo -e "Said, "I can't answer your thoughts. I want you to hit others." Stunned\n I had no choice but to stand there, but I still want to believe!" | sentence-splitter
Sentence 0:I was told, "I can't answer your thoughts. I want you to hit others."
Sentence 1:Stunned
I had no choice but to stand there, but I still want to believe!
This tool also uses the parser used inside textlint for advanced analysis, so it is accurate even if line breaks occur in the middle of a sentence. It is divided high. Also, it is attractive that I like the handling of "" etc. and that the processing performance is quite fast. (If it wasn't Node.js, I would have adopted it)
Pragmatic Segmenter Although it is a Ruby library, there is a Pragmatic Segmenter. It is a rule-based sentence delimiter library, and its major advantage is that it supports ** multiple languages **. It is also attractive because it does not perform complicated analysis and is quick to process.
Since the Japanese sentence break rule is close to my taste, the goal of this tool development is "to be able to break Japanese sentences equal to or better than the Pragmatic Segmenter".
A Live Demo is available for this tool, and the results I tried there are shown below.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<wrapper>
<s>I was told, "I can't answer your thoughts. I want you to hit others."</s>
<s>Stunned</s>
<s>I had no choice but to stand there, but I still want to believe!</s>
</wrapper>
By the way, a Python port of this Pragmatic Segmenter is being developed as pySBD. Unfortunately, it seems that the rules for Japanese have not been ported yet.
So, the library I made this time is open to the public at ↓. https://github.com/wwwcojp/ja_sentence_segmenter
In creating the library, I developed it with the following goals.
It's published to PyPI (https://pypi.org/project/ja-sentence-segmenter/), so you can easily install it with pip. It supports Python 3.6 and above, and there are no dependent libraries so far.
$ pip install ja-sentence-segmenter
import functools
from ja_sentence_segmenter.common.pipeline import make_pipeline
from ja_sentence_segmenter.concatenate.simple_concatenator import concatenate_matching
from ja_sentence_segmenter.normalize.neologd_normalizer import normalize
from ja_sentence_segmenter.split.simple_splitter import split_newline, split_punctuation
split_punc2 = functools.partial(split_punctuation, punctuations=r"。!?")
concat_tail_te = functools.partial(concatenate_matching, former_matching_rule=r"^(?P<result>.+)(hand)$", remove_former_matched=False)
segmenter = make_pipeline(normalize, split_newline, concat_tail_te, split_punc2)
text1 = """
I was told, "I can't answer your thoughts. I want you to hit others." Stunned
I had no choice but to stand there, but I still want to believe!
"""
print(list(segmenter(text1)))
['I was told, "I can't answer your thoughts. I want you to hit others."!', 'I was stunned and had no choice but to stand there, but I still want to believe!']
import functools
from ja_sentence_segmenter.common.pipeline import make_pipeline
from ja_sentence_segmenter.concatenate.simple_concatenator import concatenate_matching
from ja_sentence_segmenter.normalize.neologd_normalizer import normalize
from ja_sentence_segmenter.split.simple_splitter import split_newline, split_punctuation
split_punc2 = functools.partial(split_punctuation, punctuations=r"。!?")
concat_mail_quote = functools.partial(concatenate_matching,
former_matching_rule=r"^(\s*[>]+\s*)(?P<result>.+)$",
latter_matching_rule=r"^(\s*[>]+\s*)(?P<result>.+)$",
remove_former_matched=False,
remove_latter_matched=True)
segmenter = make_pipeline(normalize, split_newline, concat_mail_quote, split_punc2)
text2 = """
>>I was planning to go to the barber shop tomorrow, but "in a hurry
>>Change the schedule. Therefore, at the meeting
>>Let me change the schedule. ".
I've acknowledged.
"""
print(list(segmenter(text2)))
['>>I was planning to go to the barber shop tomorrow, but he said, "I will change my schedule in a hurry. Please let me change the schedule of the meeting."', 'I've acknowledged.']
I'm almost exhausted, so I'll finish by stating future issues.
~~ Why did you do such a sober thing in the Advent Calendar article ... ~~