When a Japanese sentence is subjected to morphological analysis, spaces, punctuation marks, HTML tags, etc. in the sentence interfere with the analysis. Therefore, cleansing processing is applied to leave only the information necessary for morphological analysis.
OS: Windows 7 Language: Java
<td><tr>
If the HTML tag remains, the contents "td", "tr", etc. may be recognized as words, so delete them. If you stick the deleted sentences together, you may end up with strange words, so add a word break symbol (here, "/").
sample
Pattern.compile("<.*?>").matcher(sentence).replaceAll("/");
Example: \
\ Volume 1 \ </ td> \ About the motion of an object \ </ td>; ⇒ // Volume 1 // About the movement of objects / 2. Unification of full-width and half-width
AabB12345 Aib Okakikukeko
Sentences in which full-width half-width alphabets, full-width numbers, and half-width katakana are mixed are not good as data or appearance, so they should be unified.
sample
Normalizer.normalize(sentence, Normalizer.Form.NFKC);
Normalizer that can be used from java6 is convenient With the above process ・ Full-width alphabet ⇒ Half-width alphabet ・ Full-width numbers ⇒ Half-width numbers ・ Half-width katakana ⇒ Full-width katakana Can be converted to. (Symbols are also converted from full-width to half-width)
Example: AbB12345 Aikaki ⇒ AabB12345 Aikaki
3. 3. Remove symbols and spaces
What should i do today? ..
If there is a space left in the sentence, it will not be recognized as a word break during morphological analysis. Symbols that frequently appear, such as ".", Are prone to noise during morphological analysis. As with tag processing, if you stick the deleted sentences together, you may end up with strange words, so add a word break symbol.
sample
Pattern.compile("[\\p{Punct}!! "# $% &'() = ~ |'{+ *} <>? _- ^ \ @";: ",. ・]+").matcher(sentence).replaceAll("/");
Example: What should I do today? .. ⇒ Today / what to do /
- \\ p {Punct} indicates one of the punctuation characters:! "# $% &'() * +,-./ :; <=>? @ [] ^ _` {|} ~ ..
- Symbols are deeply dark, and this processing method generally supports only symbols that appear in sentences.
Reference: https://www.slideshare.net/tsudaa/ss-36658329 http://qiita.com/kasei-san/items/3ce2249f0a1c1af1cbd2
Recommended Posts