result
"Obake" can be sung by "you"[Degree of similarity:0.24651645]
-① Word search tool that can step on the rhyme --API used -COTOHA Similarity Judgment API -② Search word pool generation tool --API used -COTOHA Parsing API -Qiita Post Article Acquisition API
--A tool that extracts specified words and words that can be rhetoric from CSV files --Similarity judgment between words using COTOHA API as an option
--Convert the specified word and the word in the CSV file to Romaji using pykakashi
converter
def convert_hiragana_to_roma(self, target_word_hiragana):
#In the case of sokuon
#Since "tsu" and "tsu" are converted to the same "tsu", use "x" as a special character.
if target_word_hiragana == "Tsu":
return "x"
else:
kakasi_lib = kakasi()
#Hiragana in romaji
kakasi_lib.setMode('H', 'a')
conv = kakasi_lib.getConverter()
target_word_roma = conv.do(target_word_hiragana)
return target_word_roma
--Extract vowel patterns from romaji and compare if they are the same pattern
conditions | Original word | Before conversion | After conversion |
---|---|---|---|
Vowels only | Obake | obake | oae |
Contains a sokuon | full | ippai | ixai |
"N" is included | pike | sanma | ana |
"-" Is included | Thunder | sanda- | anaa |
converter
#Convert reading kana to phonological patterns
def convert_roma_to_phoneme_pattern(self, target_char_roma_list):
pre_phoneme = None
hit_list = []
for target_char_roma in target_char_roma_list:
#Vowel case
#Any of "Ah, uh, eh, oh"
vowel_char = self.__find_vowel_char(
target_char_roma
)
specific_char = self.__find_specific_char(
pre_phoneme,
target_char_roma
)
if vowel_char:
hit_list.append(vowel_char)
pre_phoneme = vowel_char
elif specific_char:
#Not a vowel, but a target case
#"Tsu"
#"Hmm"
#"-"
hit_list.append(specific_char)
pre_phoneme = specific_char
else:
continue
phoneme_pattern = "".join(hit_list)
return phoneme_pattern
def __find_vowel_char(self, char_roma):
#For vowels
vowel_list = ["a", "i", "u", "e", "o"]
for vowel in vowel_list:
if char_roma.find(vowel) > -1:
return vowel
else:
continue
#If not a vowel
return None
def __find_specific_char(self, pre_phoneme, char_roma):
#In the case of "n"
#In the case of "tsu":
if char_roma == "n" or char_roma == "x":
return char_roma
#In the case of "-"
#Consider the same as the previous vowel
#Example)Daa-> a
elif pre_phoneme != None and char_roma == "-":
return pre_phoneme
else:
return None
execute
$cd src
$python main.py ghost
result
"Obake" can be rhymed with "answer"
"Obake" can linger with "you"
--After extracting the combination of words that can be rhymed, set the specified word to base_word
and the word extracted from CSV to pool_word
for analysis.
cotoha_client.py
def check_score(self, base_word, pool_word, access_token):
headers = {
"Content-Type": COTOHA_CONTENT_TYPE,
"charset": COTOHA_CHAR_SET,
"Authorization": "Bearer {}".format(access_token)
}
data = {
"s1": base_word,
"s2": pool_word,
"type": "default"
}
req = urllib.request.Request(
f"{COTOHA_BASE_URL}/{COTOHA_SIMILARITY_API_NAME}",
json.dumps(data).encode(),
headers
)
time.sleep(COTOHA_REQUEST_SLEEP_TIME)
with urllib.request.urlopen(req) as res:
body = res.read()
return json.loads(body.decode())["result"]["score"]
execute
$cd src
$python main.py ghost
result
"Obake" can be sung with "answer"[Degree of similarity:0.063530244]
"Obake" can be sung by "you"[Degree of similarity:0.24651645]
--Originally, during development, I used the noun list attached to mecab as a word pool. ――I thought it would be more interesting if I could create a mechanism to increase the number and types of vocabulary, so I came up with a tool to generate a word pool.
-(1) A tool that generates CSV for word search used in word search tools that can be used in rhymes.
--Get the title of the posted article obtained by Qiita's posted article acquisition API
qiita_client.py
def list_articles(self):
req = urllib.request.Request(
f"{QIITA_BASE_URL}/{QIITA_API_NAME}?page={QIITA_PAGE_NUMBERS}&per_page={QIITA_ITEMS_PAR_PAGE}"
)
with urllib.request.urlopen(req) as res:
body = res.read()
return json.loads(body.decode())
--Classify the acquired titles into part of speech by applying COTOHA's parsing API
cotoha_client.py
# target_Put Qiita article title in sentence
def parse(self, target_sentence, access_token):
headers = {
"Content-Type": COTOHA_CONTENT_TYPE,
"charset": COTOHA_CHAR_SET,
"Authorization": "Bearer {}".format(access_token)
}
data = {
"sentence": target_sentence,
}
req = urllib.request.Request(
f"{COTOHA_BASE_URL}/{COTOHA_PARSE_API_NAME}",
json.dumps(data).encode(),
headers
)
time.sleep(COTOHA_REQUEST_SLEEP_TIME)
with urllib.request.urlopen(req) as res:
body = res.read()
return json.loads(body.decode())["result"]
--Extract only nouns from part of speech and output to CSV file
finder.py
#Extracts only nouns from parsing results and returns a list of them
def find_noun(self, target_sentence_element):
noun_list = []
for element_num in range(len(target_sentence_element)):
tokens = target_sentence_element[element_num]["tokens"]
for tokens_num in range(len(tokens)):
target_form = tokens[tokens_num]["form"]
target_kana = tokens[tokens_num]["kana"]
target_pos = tokens[tokens_num]["pos"]
#If it is a noun, store it in the list
if target_pos == TARGET_CLASS:
#English, numbers, and symbolic words store reading kana instead
# TODO:There is room for improvement in the judgment.
if re.match(FINDER_REGEX, target_form):
noun_list.append(target_kana)
else:
noun_list.append(target_form)
return noun_list
execute
$cd tool
$python word_pool_generator.py
word_pool.csv
backup
tool
ABC
string
visual
studio
code
Note
management
Expansion
Summary
paper
Commentary
――Honestly, it's heavy. Even 40 posted articles will take about 5 minutes to process.
――The number of nouns that can be extracted in one article title is about 2 to 5.
――However, since I was touching pandas
for the first time when outputting to a CSV file, I think that I can improve the logic further.
――It's a level that I made something that works for the time being.
--Improved judgment of English words --In the current logic, "Raspberry Pi" will be "Raspberry Pi" instead of "Raspberry Pi". ――For example, if you can pass only "Raspberry" to the parsing API and judge it as "Raspberry", you can make it feel a little better by devising the way of passing words. ――By the way, "Google" was "Google".
――Be able to increase the variation of words ――It seems that you can collect words from other fields by scraping on other sites.
――The reason why I made this kind of thing in the first place is that I saw this article about half a year ago and said with a friend, "I can use natural language processing to rhyme. I had a conversation, "Can I find a word?" ――However, I didn't know the field of natural language processing at all at that time (even now), and when I happened to see this project, I thought that I could make something close to it, so I decided to create this tool. I made it.
Recommended Posts