What is Tanka by chance?

It is twitter bot that extracts and posts the part that happens to be 57757 from the text of wikipedia. Here was created in 2015. The original one seems to be made of ruby + MeCab, so this time I will reprint it with Python and Sudachi. Thank you and respect for your seniors.

environment

Windows10 Home (64bit)

python 3.7.4 SudachiDict-full==20191224 SudachiPy==0.4.2

The environment was built with pipenv

Why Sudachi

In Sudachi, you can specify the division mode, and you can decompose the part of speech in the largest possible unit. If you use MeCab, it will be broken down so much that you may pick up strange divisions such as "Tokyo Sky" in the upper phrase and "Tree" in the lower phrase. Sudachi reduces this risk.

Below, from Sudachi's GitHub.

Split mode Sudachi offers three split modes, A, B and C, starting from the shortest. A is equivalent to UniDic short units, C is equivalent to named entities, and B is an intermediate unit between A and C.

An example is shown below.

(When using the core dictionary)

A: Election / Administration / Committee / Meeting B: Election / Administration / Committee C: Election Commission

A: Cabin / Crew / Staff B: Flight attendants / crew C: Flight attendants

It's a feeling, but the fledging A is about the same as MeCab.

Install SudachiPy

Install SudachiPy and dictionary by referring to GitHub of SudachiPy There are three types of dictionaries: small, core, and full. This time I installed a full dictionary with a lot of vocabulary.

pip install SudachiPy
pip install https://object-storage.tyo2.conoha.io/v1/nc_2520839e1f9641b08211a5c85243124a/sudachi/SudachiDict_full-20191224.tar.gz

Link dictionary

Regarding the dictionary, there seems to be a method to specify from json in the source and a method to link the default dictionary from the command. This time I chose the latter. I didn't write it in the document, but when I did it normally, I got a Permission error, so execute the following command ** from the prompt with administrator privileges. Hamari point here. (It is unconfirmed what happens on Linux and Mac.)

sudachipy link -t full

Source code

Since I made it loosely, there may be a nonsense part in how to make it. I think there is. The modes are divided so that both haiku and tanka can be detected. At first, the approach was to extract 31-sound sentences and see if it would be 57757. However, when I tried it, it was not detected at all, so I made it possible to search even a part of a long sentence and switched it with the setting.precision flag. The former when you want a clean one with less noise, the latter when you want a number anyway.

The basic policy is to count the readings while setting a flag, and extract the ones whose sound breaks fit within 57577. Reject when it becomes 579 or 57578. If a particle or auxiliary verb comes at the beginning of a phrase, it is also rejected. Delete the head part of the text and search again. Also, [ya, u, yo] etc. are not counted as the number of notes. I'm using importlib because I finally made it an exe with pyinstaller.

`search_tanka.py`


import re
import importlib
from sudachipy import tokenizer
from sudachipy import dictionary

setting = importlib.import_module('setting')
tokenizer_obj = dictionary.Dictionary().create()
#Set the division unit to the longest
split_mode = tokenizer.Tokenizer.SplitMode.C
#The path of the text file to read
searchfile_path = "./search_text/" + setting.search_file_name
#The path of the text file to export
savefile_path = "./result_text/" + setting.save_file_name
#Mode switching between haiku and tanka
if setting.mode == 1:
    break_points = [5, 12, 17]
else:
    break_points = [5, 12, 17, 24, 31]
#Katakana regular expression
re_katakana = re.compile(r'[\u30A1-\u30F4]+')

#Text file open
with open(searchfile_path, encoding="utf-8_sig") as f:
    #Read all lines and list
    areas = f.readlines()
    for line in areas:
        # "。"Or"."Or separate with a line break
        sentences = re.split('[.。\n]', line)
        for sentence in sentences:
            #Through when searching by sentence
            if setting.precision == 1:
                pass
            #Tanka, haiku, and sentences with more than each character are not detected.
            else:
                if len(sentence) > break_points[-1]:
                    continue

            #Morphological analysis
            m = tokenizer_obj.tokenize(sentence, split_mode)
            #Cast MorphemeList to List
            m = list(m)

            retry = True
            while retry:
                break_point_header_flag = True
                retry = False
                counter = 0
                break_point_index = 0
                reading = ""
                surface = ""
                #Determine if the sentence is cut off at each phrase break
                for mm in m:
                    if break_point_header_flag == True:
                        text_type = mm.part_of_speech()[0]
                        #If the beginning of each phrase is not an appropriate part of speech, it will not be detected.
                        if text_type in setting.skip_text_type:
                            #Search again if long investigation is on
                            if setting.precision == 1:
                                retry = True
                                del m[0]
                                break
                            else:
                                counter = 0
                                break
                        else:
                            break_point_header_flag = False
                    #Analyze readings
                    reading_text = mm.reading_form()
                    surface_text = mm.surface()
                    if len(reading_text) > 7:
                        #Search again if long investigation is on
                        if setting.precision == 1:
                            retry = True
                            del m[0]
                            break
                        else:
                            counter = 0
                            break
                    #If the analysis result is a character that should be skipped, skip it
                    if reading_text in setting.skip_text:
                        sentence = sentence.replace(mm.surface(), "")
                        continue
                    #Since the katakana person's name does not come in, complement it with surface
                    if reading_text == "":
                        text_surface = mm.surface()
                        if re_katakana.fullmatch(text_surface):
                            reading_text = text_surface
                        #Skip if a kanji that cannot be read in the dictionary appears
                        else:
                            #Search again if long investigation is on
                            if setting.precision == 1:
                                retry = True
                                del m[0]
                                break
                            else:
                                counter = 0
                                break
                    #Count the number of phonetic primes in reading
                    counter += len(reading_text)
                    reading = reading + reading_text
                    surface = surface + surface_text
                    #Minus the count if there are compatible phonemes that do not count
                    for letter in setting.skip_letters:
                        if letter in reading_text:
                            counter -= reading_text.count(letter)
                    #Did the count advance by the number of characters in each phrase?
                    if counter == break_points[break_point_index]:
                        break_point_header_flag = True
                        #If you haven't come to the end, go to the next phrase
                        if counter != break_points[-1]:
                            break_point_index += 1
                            reading = reading + " "
                    #Play when the number of characters specified in each phrase is exceeded.
                    elif counter > break_points[break_point_index]:
                        #Search again if long investigation is on
                        if setting.precision == 1:
                            retry = True
                            del m[0]
                            break
                        else:
                            counter = 0
                            break

                #Pick up the one that can be detected exactly with the specified number of characters and add it to the file
                if counter == break_points[-1]:
                    with open(savefile_path, "a") as f:
                        try:
                            print(surface + " ")
                            print("(" + reading + ")" + "\n")
                            f.write(surface  + "\n")
                            f.write("(" + reading + ")" + "\n")
                            f.write("\n")
                        except Exception as e:
                            print(e)

                if len(m) < len(break_points):
                    break

`setting.py`


# mode (Detection mode) 1:Haiku 2:Tanka
mode = 2
# precision (accuracy) 1:Low 2:High
#Low number of detections:Many,noise:Many、実行時間:High
#High number of detections:Small,noise:Small、実行時間:Small
precision = 1
#Characters that do not count as phonemes
skip_letters = ['Turbocharger','Yu','Yo']
#Files to be detected
search_file_name = "jawiki-latest-pages-articles.xml-001.txt"
#File to save the detection result
save_file_name = "result.txt"
#Part of speech that should not come to the beginning of the phrase
skip_text_type = ["Particle", "Auxiliary verb", "Suffix", "Auxiliary symbol"]
#Characters not included in the analysis target
skip_text = ["、", "Kigo", "=", "・"]

Execution result

Scanned against wikipedia text. Some excerpts below.

Plato changed style from the middle term under the influence of Isocrates (Platonha Isocrates no Eikyo Wokechu Kiyori Buntai Wokae)

A monument to the birthplace of Shofu is erected, which is said to be the beginning. (Hajimarit Iwareshofu Hasshono Chinoishibumigata Terra Retail)

Special emphasis on bloodline, which is an individual tradition in folklore (Soushoni Okerukobetsuno Densho de Arketsumyakuwo Tokuniomonji)

Nouakchott was built as a city to become the capital of the future (Shorino Stoninalbeki Toshitoshite Nouakchott Gakensetsu Saleta)

Japanese style octopus fishing using an octopus trap with the cooperation of Japan (Nippon no Kyoryokuniyori Takotsubowo Tsukauni Nippon Shikinota Koryo)

Future outlook

If you do it again, I would like to support the following.

・ Regarding the reading of numbers With Sudachi, it seems that the reading of numbers does not work at present.

Example) It was generally between 50 and 150 cars. ↓ (Omuneha Gorei Ryokara Strawberry Rei Ryo no Aida de Suiishitita)

There is a plan, but the implementation is undecided. It seems that we can deal with it by trying hard to write a dictionary of the plug-in, or by cutting out only the numbers with regular expressions and analyzing them with another engine. (However, it is another story whether it will be possible to extract good tanka when it corresponds.) ・ Regarding execution speed I expected it, but it's late ... This is a problem with my implementation before Sudachi was slow or Python was slow. sorry. Since it is honestly turned with a for statement, it should be faster if you use itertools properly.

Impressions

After all, I made the original sauce almost without looking at it (ぉぉ). Occasionally there was noise or something that was cut off at strange places, but it worked mostly correctly. I think the final pickup work is still human power. How do you tweet with a bot or fetch a quote source?

A lot of them are detected, but at the moment, my favorite is the following.

** Annoying act of copying and writing content lessons many times ** ** (Naiyouno Reswonandomo Kopipeshite Kakikomutoi Meiwakukoui) **

Well then.

A story about making a tanka by chance with Sudachi Py