It is twitter bot that extracts and posts the part that happens to be 57757 from the text of wikipedia. Here was created in 2015. The original one seems to be made of ruby + MeCab, so this time I will reprint it with Python and Sudachi. Thank you and respect for your seniors.
Windows10 Home (64bit)
python 3.7.4 SudachiDict-full==20191224 SudachiPy==0.4.2
The environment was built with pipenv
In Sudachi, you can specify the division mode, and you can decompose the part of speech in the largest possible unit. If you use MeCab, it will be broken down so much that you may pick up strange divisions such as "Tokyo Sky" in the upper phrase and "Tree" in the lower phrase. Sudachi reduces this risk.
Below, from Sudachi's GitHub.
Split mode Sudachi offers three split modes, A, B and C, starting from the shortest. A is equivalent to UniDic short units, C is equivalent to named entities, and B is an intermediate unit between A and C.
An example is shown below.
(When using the core dictionary)
A: Election / Administration / Committee / Meeting B: Election / Administration / Committee C: Election Commission
A: Cabin / Crew / Staff B: Flight attendants / crew C: Flight attendants
It's a feeling, but the fledging A is about the same as MeCab.
Install SudachiPy and dictionary by referring to GitHub of SudachiPy There are three types of dictionaries: small, core, and full. This time I installed a full dictionary with a lot of vocabulary.
pip install SudachiPy
pip install https://object-storage.tyo2.conoha.io/v1/nc_2520839e1f9641b08211a5c85243124a/sudachi/SudachiDict_full-20191224.tar.gz
Regarding the dictionary, there seems to be a method to specify from json in the source and a method to link the default dictionary from the command. This time I chose the latter. I didn't write it in the document, but when I did it normally, I got a Permission error, so execute the following command ** from the prompt with administrator privileges. Hamari point here. (It is unconfirmed what happens on Linux and Mac.)
sudachipy link -t full
Since I made it loosely, there may be a nonsense part in how to make it. I think there is. The modes are divided so that both haiku and tanka can be detected. At first, the approach was to extract 31-sound sentences and see if it would be 57757. However, when I tried it, it was not detected at all, so I made it possible to search even a part of a long sentence and switched it with the setting.precision flag. The former when you want a clean one with less noise, the latter when you want a number anyway.
The basic policy is to count the readings while setting a flag, and extract the ones whose sound breaks fit within 57577. Reject when it becomes 579 or 57578. If a particle or auxiliary verb comes at the beginning of a phrase, it is also rejected. Delete the head part of the text and search again. Also, [ya, u, yo] etc. are not counted as the number of notes. I'm using importlib because I finally made it an exe with pyinstaller.
search_tanka.py
import re
import importlib
from sudachipy import tokenizer
from sudachipy import dictionary
setting = importlib.import_module('setting')
tokenizer_obj = dictionary.Dictionary().create()
#Set the division unit to the longest
split_mode = tokenizer.Tokenizer.SplitMode.C
#The path of the text file to read
searchfile_path = "./search_text/" + setting.search_file_name
#The path of the text file to export
savefile_path = "./result_text/" + setting.save_file_name
#Mode switching between haiku and tanka
if setting.mode == 1:
break_points = [5, 12, 17]
else:
break_points = [5, 12, 17, 24, 31]
#Katakana regular expression
re_katakana = re.compile(r'[\u30A1-\u30F4]+')
#Text file open
with open(searchfile_path, encoding="utf-8_sig") as f:
#Read all lines and list
areas = f.readlines()
for line in areas:
# "。"Or"."Or separate with a line break
sentences = re.split('[.。\n]', line)
for sentence in sentences:
#Through when searching by sentence
if setting.precision == 1:
pass
#Tanka, haiku, and sentences with more than each character are not detected.
else:
if len(sentence) > break_points[-1]:
continue
#Morphological analysis
m = tokenizer_obj.tokenize(sentence, split_mode)
#Cast MorphemeList to List
m = list(m)
retry = True
while retry:
break_point_header_flag = True
retry = False
counter = 0
break_point_index = 0
reading = ""
surface = ""
#Determine if the sentence is cut off at each phrase break
for mm in m:
if break_point_header_flag == True:
text_type = mm.part_of_speech()[0]
#If the beginning of each phrase is not an appropriate part of speech, it will not be detected.
if text_type in setting.skip_text_type:
#Search again if long investigation is on
if setting.precision == 1:
retry = True
del m[0]
break
else:
counter = 0
break
else:
break_point_header_flag = False
#Analyze readings
reading_text = mm.reading_form()
surface_text = mm.surface()
if len(reading_text) > 7:
#Search again if long investigation is on
if setting.precision == 1:
retry = True
del m[0]
break
else:
counter = 0
break
#If the analysis result is a character that should be skipped, skip it
if reading_text in setting.skip_text:
sentence = sentence.replace(mm.surface(), "")
continue
#Since the katakana person's name does not come in, complement it with surface
if reading_text == "":
text_surface = mm.surface()
if re_katakana.fullmatch(text_surface):
reading_text = text_surface
#Skip if a kanji that cannot be read in the dictionary appears
else:
#Search again if long investigation is on
if setting.precision == 1:
retry = True
del m[0]
break
else:
counter = 0
break
#Count the number of phonetic primes in reading
counter += len(reading_text)
reading = reading + reading_text
surface = surface + surface_text
#Minus the count if there are compatible phonemes that do not count
for letter in setting.skip_letters:
if letter in reading_text:
counter -= reading_text.count(letter)
#Did the count advance by the number of characters in each phrase?
if counter == break_points[break_point_index]:
break_point_header_flag = True
#If you haven't come to the end, go to the next phrase
if counter != break_points[-1]:
break_point_index += 1
reading = reading + " "
#Play when the number of characters specified in each phrase is exceeded.
elif counter > break_points[break_point_index]:
#Search again if long investigation is on
if setting.precision == 1:
retry = True
del m[0]
break
else:
counter = 0
break
#Pick up the one that can be detected exactly with the specified number of characters and add it to the file
if counter == break_points[-1]:
with open(savefile_path, "a") as f:
try:
print(surface + " ")
print("(" + reading + ")" + "\n")
f.write(surface + "\n")
f.write("(" + reading + ")" + "\n")
f.write("\n")
except Exception as e:
print(e)
if len(m) < len(break_points):
break
setting.py
# mode (Detection mode) 1:Haiku 2:Tanka
mode = 2
# precision (accuracy) 1:Low 2:High
#Low number of detections:Many,noise:Many、実行時間:High
#High number of detections:Small,noise:Small、実行時間:Small
precision = 1
#Characters that do not count as phonemes
skip_letters = ['Turbocharger','Yu','Yo']
#Files to be detected
search_file_name = "jawiki-latest-pages-articles.xml-001.txt"
#File to save the detection result
save_file_name = "result.txt"
#Part of speech that should not come to the beginning of the phrase
skip_text_type = ["Particle", "Auxiliary verb", "Suffix", "Auxiliary symbol"]
#Characters not included in the analysis target
skip_text = ["、", "Kigo", "=", "・"]
Scanned against wikipedia text. Some excerpts below.
Plato changed style from the middle term under the influence of Isocrates (Platonha Isocrates no Eikyo Wokechu Kiyori Buntai Wokae)
A monument to the birthplace of Shofu is erected, which is said to be the beginning. (Hajimarit Iwareshofu Hasshono Chinoishibumigata Terra Retail)
Special emphasis on bloodline, which is an individual tradition in folklore (Soushoni Okerukobetsuno Densho de Arketsumyakuwo Tokuniomonji)
Nouakchott was built as a city to become the capital of the future (Shorino Stoninalbeki Toshitoshite Nouakchott Gakensetsu Saleta)
Japanese style octopus fishing using an octopus trap with the cooperation of Japan (Nippon no Kyoryokuniyori Takotsubowo Tsukauni Nippon Shikinota Koryo)
If you do it again, I would like to support the following.
・ Regarding the reading of numbers With Sudachi, it seems that the reading of numbers does not work at present.
Example) It was generally between 50 and 150 cars. ↓ (Omuneha Gorei Ryokara Strawberry Rei Ryo no Aida de Suiishitita)
There is a plan, but the implementation is undecided. It seems that we can deal with it by trying hard to write a dictionary of the plug-in, or by cutting out only the numbers with regular expressions and analyzing them with another engine. (However, it is another story whether it will be possible to extract good tanka when it corresponds.) ・ Regarding execution speed I expected it, but it's late ... This is a problem with my implementation before Sudachi was slow or Python was slow. sorry. Since it is honestly turned with a for statement, it should be faster if you use itertools properly.
After all, I made the original sauce almost without looking at it (ぉ ぉ). Occasionally there was noise or something that was cut off at strange places, but it worked mostly correctly. I think the final pickup work is still human power. How do you tweet with a bot or fetch a quote source?
A lot of them are detected, but at the moment, my favorite is the following.
** Annoying act of copying and writing content lessons many times ** ** (Naiyouno Reswonandomo Kopipeshite Kakikomutoi Meiwakukoui) **
Well then.
Recommended Posts