[JAVA] Cleansing processing of Japanese sentences

Overview

When a Japanese sentence is subjected to morphological analysis, spaces, punctuation marks, HTML tags, etc. in the sentence interfere with the analysis. Therefore, cleansing processing is applied to leave only the information necessary for morphological analysis.

Execution environment

OS: Windows 7 Language: Java

table of contents

    1. HTML tag eraser
  1. Unification of full-width and half-width
    1. Remove sign

1. 1. HTML tag eraser

<td><tr>

If the HTML tag remains, the contents "td", "tr", etc. may be recognized as words, so delete them. If you stick the deleted sentences together, you may end up with strange words, so add a word break symbol (here, "/").

sample


Pattern.compile("<.*?>").matcher(sentence).replaceAll("/");

Example: \ \ Volume 1 \ </ td> \ About the motion of an object \ </ td>; ⇒ // Volume 1 // About the movement of objects /

2. Unification of full-width and half-width

AabB12345 Aib Okakikukeko

Sentences in which full-width half-width alphabets, full-width numbers, and half-width katakana are mixed are not good as data or appearance, so they should be unified.

sample


Normalizer.normalize(sentence, Normalizer.Form.NFKC);

Normalizer that can be used from java6 is convenient With the above process ・ Full-width alphabet ⇒ Half-width alphabet ・ Full-width numbers ⇒ Half-width numbers ・ Half-width katakana ⇒ Full-width katakana Can be converted to. (Symbols are also converted from full-width to half-width)

Example: AbB12345 Aikaki ⇒ AabB12345 Aikaki

3. 3. Remove symbols and spaces

What should i do today? ..

If there is a space left in the sentence, it will not be recognized as a word break during morphological analysis. Symbols that frequently appear, such as ".", Are prone to noise during morphological analysis. As with tag processing, if you stick the deleted sentences together, you may end up with strange words, so add a word break symbol.

sample


Pattern.compile("[\\p{Punct}!! "# $% &'() = ~ |'{+ *} <>? _- ^ \ @";: ",. ・]+").matcher(sentence).replaceAll("/");

Example: What should I do today? .. ⇒ Today / what to do /

Reference: https://www.slideshare.net/tsudaa/ss-36658329 http://qiita.com/kasei-san/items/3ce2249f0a1c1af1cbd2

Recommended Posts

Cleansing processing of Japanese sentences
Japanese localization of Ubuntu20.04
Japanese localization of Eclipse
Japanese localization of error messages
[Japanese localization of gem: devise]
Summary of java error processing
[Rails] Japanese localization of error messages
Implementation of asynchronous processing in Tomcat
Japanese localization of error messages (rails)
Basic processing flow of java Stream
[Kotlin] Example of processing using Enum
Order of processing in the program