** Explain in advance ** This is what I made while doing a part-time warrior during the summer vacation and rubbing my sleepy eyes before going to bed. It works, but there may be some inefficient parts, insecure parts, unused variables, etc. If you find an improvement, I'd be happy if you could tell me in gentle words.

During this summer vacation, a movement to develop SlackBot suddenly occurred among the first year students of our circle. One of the functions of the bot that I made by taking advantage of the trend is equipped with a function that finds out the part that is 575 from Wikipedia. I will write about what I learned when implementing it as a memorandum.

Get the body of a random Wikipedia page

URL to access a random Wikipedia page

There was a SlackBot that was made by my seniors and introduced a random Wikipedia page, so I investigated if there was a way, and as a result, I was skipped to a random Wikipedia article by accessing the following URL. http://ja.wikipedia.org/wiki/Special:Randompage I had written scraping itself in C #, but it was my first time writing it in Python. https://qiita.com/poorko/items/9140c75415d748633a10 Refer to this site,

`python`


import requests
import pandas as pd
from bs4 import BeautifulSoup

html=requests.get("http://ja.wikipedia.org/wiki/Special:Randompage").text
soup=BeautifulSoup(html,"html.parser")
for script in soup(["script", "style"]):
    script.decompose()

Write. (In the citation source, it is included in the list for each line break, but since it is not necessary to detect it across sentences to detect 575, I used this list as it is)

Detection of 575 parts

Morphological analysis

Detecting 575 is, in other words, detecting that "look at the reading of the sentence and divide it into 575 word by word". That is, you have to look at the reading of the sentence and the word breaks. Is it useful there? ** Morphological analysis **. (Strictly speaking, morphemes and words are not the same, but they are troublesome, so I don't think deeply.) First, let's count the number of characters in a sentence.

`python`


def howmuch(moziyomi):
    i = 0
    for chara in moziyomi:
        if chara == '-':
            i = i + 1
        for kana in [chr(i) for i in range(12449, 12532 + 1)]:
            if chara == kana:
                i = i + 1
                if chara == 'Turbocharger' or chara == 'Yu' or chara == 'Yo'or chara == 'A'or chara == 'I'or chara == 'U'or chara == 'E'or chara == 'Oh':
                    i = i - 1
    return (i)

When morphological analysis is performed with Janome, the reading is returned in full-width katakana. So count the number of characters in the returned katakana string. The stretch bar "-" is counted as one character, and small katakana other than "tsu" are ignored.

Next is the 575 judgment part

`python`


        fin = False
        flag = False
        for file in files:
            # print(file)
            s = file
            if s.find('Edit') > 0:
                flag = True
            if flag:
                words = [token.surface for token in t.tokenize(s)]
                hinsi = [token.part_of_speech.split(',')[0] for token in t.tokenize(s)]
                yomi = [token.reading for token in t.tokenize(s)]

                for i in range(len(words)):
                    if fin:
                        break
                    uta = ""
                    utayomi = ""
                    kami = ""
                    naka = ""
                    simo = ""
                    keyword = ""
                    if hinsi[i] == "noun":  # hinsi[i] == "verb" or
                        keyword = words[i]
                        num = 0
                        utastat = 0
                        count = i
                        while num < 18 and count < len(yomi) and yomi[count].find("*") < 0:
                            num = num + howmuch(yomi[count])
                            uta = uta + words[count]
                            utayomi = utayomi + yomi[count]

                            if utastat == 0:
                                kami = kami + words[count]
                                if num > 5:
                                    break
                                elif num == 5:
                                    utastat = 1
                            elif utastat == 1:
                                naka = naka + words[count]
                                if num > 12:
                                    break
                                elif num == 12:
                                    utastat = 2
                            else:
                                simo = simo + words[count]

                            if num == 17:
                                if utayomi.find("。") >= 0:
                                    continue
                                elif (utayomi.find("（") >= 0 and utayomi.find("）") >= 0) or (
                                        utayomi.find("「") >= 0 and utayomi.find("」") >= 0) or (
                                        utayomi.find("＜") >= 0 and utayomi.find("＞") >= 0) or (
                                        utayomi.find("『") >= 0 and utayomi.find("』") >= 0):
                                    fin = True
                                    break
                                elif utayomi.find("（") >= 0 or utayomi.find("）") >= 0 or utayomi.find(
                                        "「") >= 0 or utayomi.find("」") >= 0 or utayomi.find("＜") >= 0 or utayomi.find(
                                    "＞") >= 0 or utayomi.find("『") >= 0 or utayomi.find(
                                    "』") >= 0:
                                    continue
                                elif uta != "" and uta.find("Link source") < 0:
                                    fin = True
                                    break
                            count = count + 1

What we are doing here

--Check each line and ignore the lines until the word "edit" appears. (Otherwise, it may contain a character string that is common to all pages such as "main page") --When a noun or verb comes, count the character string from it. (Because I thought that 575 would be a natural senryu if it started with a noun or verb) --Check if the reading of the string contains the symbol "\ *", and if so, find the next noun verb and start over. (Because Janome returns unreadable characters such as numbers with "\ *") ――If you look at it while separating it with words and it is not separated by just 575, look for the next noun or verb and count from there again. ――When you straddle ".", Find the next noun or verb and start over. (Because it becomes unnatural if you straddle sentences within 575) --If there is a beginning of the parenthesis symbol, check if there is a closing parenthesis in 575 (But with this confirmation method, "" and "toka are all right and not enough) --If the string "link source" is included in 575, start over. (Because it returns non-specific senryu such as "update of link source related page".)

If 575 is not found, repeat the previous operation. (Go to the random page again and do the same)

The whole picture of the fucking code I wrote

`python`



    def howmuch(moziyomi):
        i = 0
        for chara in moziyomi:
            if chara == '-':
                i = i + 1
            for kana in [chr(i) for i in range(12449, 12532 + 1)]:
                if chara == kana:
                    i = i + 1
                    if chara == 'Turbocharger' or chara == 'Yu' or chara == 'Yo':
                        i = i - 1
        return (i)

    hujubun = True
    while hujubun:
        html = requests.get("http://ja.wikipedia.org/wiki/Special:Randompage").text
        soup = bs4.BeautifulSoup(html, "html.parser")
        for script in soup(["script", "style"]):
            script.decompose()
        text = soup.get_text()
        # print(text)
        t = Tokenizer()
        files = text.split("\n")
        fin = False
        flag = False
        for file in files:
            # print(file)
            s = file
            if s.find('Edit') > 0:
                flag = True
            if flag:
                words = [token.surface for token in t.tokenize(s)]
                hinsi = [token.part_of_speech.split(',')[0] for token in t.tokenize(s)]
                yomi = [token.reading for token in t.tokenize(s)]

                for i in range(len(words)):
                    if fin:
                        break
                    uta = ""
                    utayomi = ""
                    kami = ""
                    naka = ""
                    simo = ""
                    keyword = ""
                    if hinsi[i] == "noun":  # hinsi[i] == "verb" or
                        keyword = words[i]
                        num = 0
                        utastat = 0
                        count = i
                        while num < 18 and count < len(yomi) and yomi[count].find("*") < 0:
                            num = num + howmuch(yomi[count])
                            uta = uta + words[count]
                            utayomi = utayomi + yomi[count]

                            if utastat == 0:
                                kami = kami + words[count]
                                if num > 5:
                                    break
                                elif num == 5:
                                    utastat = 1
                            elif utastat == 1:
                                naka = naka + words[count]
                                if num > 12:
                                    break
                                elif num == 12:
                                    utastat = 2
                            else:
                                simo = simo + words[count]

                            if num == 17:
                                if utayomi.find("。") >= 0:
                                    continue
                                elif (utayomi.find("（") >= 0 and utayomi.find("）") >= 0) or (
                                        utayomi.find("「") >= 0 and utayomi.find("」") >= 0) or (
                                        utayomi.find("＜") >= 0 and utayomi.find("＞") >= 0) or (
                                        utayomi.find("『") >= 0 and utayomi.find("』") >= 0):
                                    fin = True
                                    break
                                elif utayomi.find("（") >= 0 or utayomi.find("）") >= 0 or utayomi.find(
                                        "「") >= 0 or utayomi.find("」") >= 0 or utayomi.find("＜") >= 0 or utayomi.find(
                                    "＞") >= 0 or utayomi.find("『") >= 0 or utayomi.find(
                                    "』") >= 0:
                                    continue
                                elif uta != "" and uta.find("Link source") < 0:
                                    fin = True
                                    break
                            count = count + 1
        if uta != "" and uta.find("Link source") < 0 and uta.find("Used under") < 0:
            hujubun = False
    print(kami + "\n" + naka + "\n" + simo)

I think this will probably work. Since the code itself has not been thoroughly reviewed, there may be unused variables and apparently inefficient parts, but since it is a child who grows up with praise, it is really easy to point out ...

Find the part that is 575 from Wikipedia in Python

Get the body of a random Wikipedia page

URL to access a random Wikipedia page

python

Detection of 575 parts

Morphological analysis

python

python

The whole picture of the fucking code I wrote

python

`python`

`python`

`python`

`python`