Create a function to easily correct half-width and full-width notation fluctuations.
Prepare the characters before and after the conversion.
abc_half = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
abc_full = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
digit_half = "0123456789"
digit_full = "0123456789"
katakana_half = "Yayuyo Sashisuseso Sashisuseso Sashisuseso Sashisuseso Sashisuseso Sashisuseso Nani Nino Hahifu Hehomamum Memomo"
katakana_full = "Aiueokakikukekosashisusesotachitsutetonaninunenohahifuhehomamimumeyayuyorarirurerowon"
punc_half = "!\#$%&\()*+,-./:;<=>?@[\\]^_`{|}~"
punc_full = "!#$%&\()*+,-./:;<=>?@[\\]^_`{|}~"
Since the plosive sound of half-width katakana expresses one character with two characters, create a conversion table separately from the others.
tmp01 = "Gagging, Going, Going, Going, Going, Go, Go, Go, Go, Go, Go, Go, Go, Go, Go, Go, Go, Go, Go, Go, Go, Go, Go"
tmp02 = "Gagigugegozajizuzezodajizudedobababibbebopapipupepo"
transtable02 = {}
for i in range(len(tmp02)):
be = tmp01[i*2:i*2+2]
af = tmp02[i]
transtable02[be] = af
In the function clean_text
,transtable01 = str.maketrans (before, after)
creates a translation table and applies it withtext = text.translate (transtable01)
.
def clean_text(text):
text = str(text).replace("\u3000", " ") #Full-width space to half-width
before = abc_full + digit_full + katakana_half + punc_full
after = abc_half + digit_half + katakana_full + punc_half
transtable01 = str.maketrans(before, after)
text = text.translate(transtable01)
text = text.translate(transtable02)
return text
text = "Memo Nara Rirure,-. / :; qrgegozajizezodaji"
clean_text(text)
>>>Memo Yayuyora Rirure+,-./:qr Gegozajizuzezodaji
that's all!
I think there are other notational fluctuations in Japanese, such as okurigana and Chinese numerals, so I hope to add more.
[Full-width ⇔ half-width] Recommended library for adjusting Japanese writing fluctuations in Python [python] Create a list of various character types
Recommended Posts