Many Japanese natural language libraries stop processing due to an error the moment you enter Hebrew or Korean. Here are some spells that are useful in such cases.
janome is a wonderful morphological analyzer that saves you the trouble of installing MeCab, but if even one non-Japanese character is mixed in, you will die with an error. In the example of reading the left language switching bar of wikipedia ...
From the bar on the left of wikipedia
text = "Other language version Italiano한 국어 Polski Simple English"
t = Tokenizer()
for token in t.tokenize(text):
print token
---------------
Traceback (most recent call last):
File "tests.py", line 98, in <module>
for token in t.tokenize(text):
File "lib/python2.7/site-packages/janome/tokenizer.py", line 107, in tokenize
pos += lattice.forward()
File "lib/python2.7/site-packages/janome/lattice.py", line 124, in forward
while not self.enodes[self.p]:
IndexError: list index out of range
python
import re
import nltk
def filter(text):
"""
:param text: str
:rtype : str
"""
#Eliminate alphabets, half-width alphanumeric characters, symbols, line breaks, and tabs
text = re.sub(r'[a-zA-Z0-9¥"¥.¥,¥@]+', '', text)
text = re.sub(r'[!"“#$%&()\*\+\-\.,\/:;<=>?@\[\\\]^_`{|}~]', '', text)
text = re.sub(r'[\n|\r|\t]', '', text)
#Eliminate non-Japanese characters(Korean, Chinese, Hebrew, etc.)
jp_chartype_tokenizer = nltk.RegexpTokenizer(u'([Ah-Hmm]+|[A-Hmm]+|[\u4e00-\u9FFF]+|[Ah-んA-Hmm\u4e00-\u9FFF]+)')
text = "".join(jp_chartype_tokenizer.tokenize(text))
return text
text = "Other language version Italiano한 국어 Polski Simple English"
text = filter(text)
t = Tokenizer()
for token in t.tokenize(text):
print token
------------------
Other prefix,Noun connection,*,*,*,*,other,Ta,Ta
Language noun,General,*,*,*,*,language,Gengo,Gengo
Version noun,suffix,General,*,*,*,Edition,Van,Van
Recommended Posts