** Explain in advance ** This is what I made while doing a part-time warrior during the summer vacation and rubbing my sleepy eyes before going to bed. It works, but there may be some inefficient parts, insecure parts, unused variables, etc. If you find an improvement, I'd be happy if you could tell me in gentle words.
During this summer vacation, a movement to develop SlackBot suddenly occurred among the first year students of our circle. One of the functions of the bot that I made by taking advantage of the trend is equipped with a function that finds out the part that is 575 from Wikipedia. I will write about what I learned when implementing it as a memorandum.
There was a SlackBot that was made by my seniors and introduced a random Wikipedia page, so I investigated if there was a way, and as a result, I was skipped to a random Wikipedia article by accessing the following URL.
http://ja.wikipedia.org/wiki/Special:Randompage
I had written scraping itself in C #, but it was my first time writing it in Python.
https://qiita.com/poorko/items/9140c75415d748633a10
Refer to this site,
python
import requests
import pandas as pd
from bs4 import BeautifulSoup
html=requests.get("http://ja.wikipedia.org/wiki/Special:Randompage").text
soup=BeautifulSoup(html,"html.parser")
for script in soup(["script", "style"]):
script.decompose()
Write. (In the citation source, it is included in the list for each line break, but since it is not necessary to detect it across sentences to detect 575, I used this list as it is)
Detecting 575 is, in other words, detecting that "look at the reading of the sentence and divide it into 575 word by word". That is, you have to look at the reading of the sentence and the word breaks. Is it useful there? ** Morphological analysis **. (Strictly speaking, morphemes and words are not the same, but they are troublesome, so I don't think deeply.) First, let's count the number of characters in a sentence.
python
def howmuch(moziyomi):
i = 0
for chara in moziyomi:
if chara == '-':
i = i + 1
for kana in [chr(i) for i in range(12449, 12532 + 1)]:
if chara == kana:
i = i + 1
if chara == 'Turbocharger' or chara == 'Yu' or chara == 'Yo'or chara == 'A'or chara == 'I'or chara == 'U'or chara == 'E'or chara == 'Oh':
i = i - 1
return (i)
When morphological analysis is performed with Janome, the reading is returned in full-width katakana. So count the number of characters in the returned katakana string. The stretch bar "-" is counted as one character, and small katakana other than "tsu" are ignored.
Next is the 575 judgment part
python
fin = False
flag = False
for file in files:
# print(file)
s = file
if s.find('Edit') > 0:
flag = True
if flag:
words = [token.surface for token in t.tokenize(s)]
hinsi = [token.part_of_speech.split(',')[0] for token in t.tokenize(s)]
yomi = [token.reading for token in t.tokenize(s)]
for i in range(len(words)):
if fin:
break
uta = ""
utayomi = ""
kami = ""
naka = ""
simo = ""
keyword = ""
if hinsi[i] == "noun": # hinsi[i] == "verb" or
keyword = words[i]
num = 0
utastat = 0
count = i
while num < 18 and count < len(yomi) and yomi[count].find("*") < 0:
num = num + howmuch(yomi[count])
uta = uta + words[count]
utayomi = utayomi + yomi[count]
if utastat == 0:
kami = kami + words[count]
if num > 5:
break
elif num == 5:
utastat = 1
elif utastat == 1:
naka = naka + words[count]
if num > 12:
break
elif num == 12:
utastat = 2
else:
simo = simo + words[count]
if num == 17:
if utayomi.find("。") >= 0:
continue
elif (utayomi.find("(") >= 0 and utayomi.find(")") >= 0) or (
utayomi.find("「") >= 0 and utayomi.find("」") >= 0) or (
utayomi.find("<") >= 0 and utayomi.find(">") >= 0) or (
utayomi.find("『") >= 0 and utayomi.find("』") >= 0):
fin = True
break
elif utayomi.find("(") >= 0 or utayomi.find(")") >= 0 or utayomi.find(
"「") >= 0 or utayomi.find("」") >= 0 or utayomi.find("<") >= 0 or utayomi.find(
">") >= 0 or utayomi.find("『") >= 0 or utayomi.find(
"』") >= 0:
continue
elif uta != "" and uta.find("Link source") < 0:
fin = True
break
count = count + 1
What we are doing here
--Check each line and ignore the lines until the word "edit" appears. (Otherwise, it may contain a character string that is common to all pages such as "main page") --When a noun or verb comes, count the character string from it. (Because I thought that 575 would be a natural senryu if it started with a noun or verb) --Check if the reading of the string contains the symbol "\ *", and if so, find the next noun verb and start over. (Because Janome returns unreadable characters such as numbers with "\ *") ――If you look at it while separating it with words and it is not separated by just 575, look for the next noun or verb and count from there again. ――When you straddle ".", Find the next noun or verb and start over. (Because it becomes unnatural if you straddle sentences within 575) --If there is a beginning of the parenthesis symbol, check if there is a closing parenthesis in 575 (But with this confirmation method, "" and "toka are all right and not enough) --If the string "link source" is included in 575, start over. (Because it returns non-specific senryu such as "update of link source related page".)
If 575 is not found, repeat the previous operation. (Go to the random page again and do the same)
python
def howmuch(moziyomi):
i = 0
for chara in moziyomi:
if chara == '-':
i = i + 1
for kana in [chr(i) for i in range(12449, 12532 + 1)]:
if chara == kana:
i = i + 1
if chara == 'Turbocharger' or chara == 'Yu' or chara == 'Yo':
i = i - 1
return (i)
hujubun = True
while hujubun:
html = requests.get("http://ja.wikipedia.org/wiki/Special:Randompage").text
soup = bs4.BeautifulSoup(html, "html.parser")
for script in soup(["script", "style"]):
script.decompose()
text = soup.get_text()
# print(text)
t = Tokenizer()
files = text.split("\n")
fin = False
flag = False
for file in files:
# print(file)
s = file
if s.find('Edit') > 0:
flag = True
if flag:
words = [token.surface for token in t.tokenize(s)]
hinsi = [token.part_of_speech.split(',')[0] for token in t.tokenize(s)]
yomi = [token.reading for token in t.tokenize(s)]
for i in range(len(words)):
if fin:
break
uta = ""
utayomi = ""
kami = ""
naka = ""
simo = ""
keyword = ""
if hinsi[i] == "noun": # hinsi[i] == "verb" or
keyword = words[i]
num = 0
utastat = 0
count = i
while num < 18 and count < len(yomi) and yomi[count].find("*") < 0:
num = num + howmuch(yomi[count])
uta = uta + words[count]
utayomi = utayomi + yomi[count]
if utastat == 0:
kami = kami + words[count]
if num > 5:
break
elif num == 5:
utastat = 1
elif utastat == 1:
naka = naka + words[count]
if num > 12:
break
elif num == 12:
utastat = 2
else:
simo = simo + words[count]
if num == 17:
if utayomi.find("。") >= 0:
continue
elif (utayomi.find("(") >= 0 and utayomi.find(")") >= 0) or (
utayomi.find("「") >= 0 and utayomi.find("」") >= 0) or (
utayomi.find("<") >= 0 and utayomi.find(">") >= 0) or (
utayomi.find("『") >= 0 and utayomi.find("』") >= 0):
fin = True
break
elif utayomi.find("(") >= 0 or utayomi.find(")") >= 0 or utayomi.find(
"「") >= 0 or utayomi.find("」") >= 0 or utayomi.find("<") >= 0 or utayomi.find(
">") >= 0 or utayomi.find("『") >= 0 or utayomi.find(
"』") >= 0:
continue
elif uta != "" and uta.find("Link source") < 0:
fin = True
break
count = count + 1
if uta != "" and uta.find("Link source") < 0 and uta.find("Used under") < 0:
hujubun = False
print(kami + "\n" + naka + "\n" + simo)
I think this will probably work. Since the code itself has not been thoroughly reviewed, there may be unused variables and apparently inefficient parts, but since it is a child who grows up with praise, it is really easy to point out ...
Recommended Posts