I tried to score the syntax that was too humorous and humorous using the COTOHA API.

Introduction

Recently, due to the influence of a certain video series [^ tokkun]

"It was too OO, and it became OO."

I am addicted to the expression. Recently, it was too cold when I left home in the morning, and I became Samgyeopsal. Like.

However, in the latter half, the part "became XX" did not come up with a surprisingly good word.

"It was so dazzling that it became a marble chocolate."

I think there are times when you compromise with a near-miss word selection. In order to warn myself, I wrote a program that uses the COTOHA API to score the syntax (I'd like to call it with respect) that I thought was too crazy.

Scoring method

The syntax is too good to be true,

"(Adjective stem) is too much, ... (noun)"

It can be said that the higher the similarity between the adjective stem and the noun, the better the syntax. But the noun shouldn't be a word that doesn't make sense. If the noun part is not a general word, I would like to give it 0 points.

For the similarity between adjective stems and nouns, I would like to use [Levenshtein distance](#Levenshtein distance implementation) for the time being. I don't want to deduct points even if the noun side becomes unnecessarily long, as I said at the beginning, "It's too cold and I became Samgyeopsal." Therefore, the noun side only looks at the number of characters on the adjective stem side [^ tukkomi].

Example of use

$ echo "I became a horse because I was too horse" | python orochimaru.py
100.0 points
I'm too horsey ... I'm a horse ...

$ echo "It was so dazzling that it became a marble chocolate" | python orochimaru.py 
33.3 points
It's too mabushi, it's become marble chocolate ...

$ echo "It was so funny that it became funny" | python orochimaru.py 
0 points
It's too funny, it's become funny ...

Preparation

Use the COTOHA API to parse the input text and determine if the noun is a general noun. Create a Developers account from the COTOHA API Portal (https://api.ce-cotoha.com/) and make a note of your Client ID and Client secret. Also, although it is not registered in PyPi, the python library is published on GitHub, so install it referring to this article. Of course, it is not necessary if you hit the API directly, and it seems that it only supports parsing and similarity calculation at the moment, so please be careful.

python is using 3.6. If you are using pyenv or something, please do it well.

$ git clone https://github.com/obilixilido/cotoha-nlp.git
$ cd cotoha-nlp/
$ pip install -e .

You can now use the COTOHA API parsing.

Implementation

Levenshtein distance implementation

[wikipedia](https://ja.wikipedia.org/wiki/%E3%83%AC%E3%83%BC%E3%83%99%E3%83%B3%E3%82%B7%E3%83 The algorithm of% A5% E3% 82% BF% E3% 82% A4% E3% 83% B3% E8% B7% 9D% E9% 9B% A2) is just implemented, so I will omit the details.

`levenshtein.py`


def levenshtein_distance(s1, s2):
    l1 = len(s1)
    l2 = len(s2)
    dp = [[0 for j in range(l2+1)] for i in range(l1+1)]
    for i in range(l1+1):
        dp[i][0] = i
    for i in range(l2+1):
        dp[0][i] = i
    
    for i in range(1, l1+1):
        for j in range(1, l2+1):
            cost = 0 if (s1[i-1] == s2[j-1]) else 1
            dp[i][j] = min([dp[i-1][j]+1, dp[i][j-1]+1, dp[i-1][j-1]+cost])
 
    return dp[l1][l2]

Parsing too much

It's a very rough implementation, but if you enter a text that matches "It's too XX and it's XX", the score and the text of the analysis result will be output.

`orochimaru.py`


from cotoha_nlp.parse import Parser
import levenshtein_distance

def find_orochi_sentence(tokens):
    form_list = ["", "Too", "hand", "", "To", "Nana", "Tsu", "Ta"]
    pos_list = ["", "Adjective suffix", "Verb suffix", "", "Case particles", "Verb stem", "Verb conjugation ending", "Verb suffix"]
    i = 0
    s1 = ""; s2 = ""
    is_unknown = False
    for token in tokens:
        if (i > 7): return 1
        if (i == 0):
            if not (token.pos == "Adjective stem"): return 1
            s1 = token.kana
        elif (i == 3):
            if not (token.pos == "noun"): return 1
            s2 = token.kana
            if ("Undef" in token.features):
                is_unknown = True
        else:
            if (i == 4 and token.pos == "noun"):
                s2 += token.kana
                if ("Undef" in token.feautes):
                    is_unknown = True
                continue
            if not (token.pos == pos_list[i] and token.form == form_list[i]): return 1
        i += 1

    if is_unknown:
        print("0 points")
    else:
        dist = levenshtein_distance.levenshtein_distance(s1, s2[:len(s1)])
        print(f"{(100 * (len(s1) - dist) / len(s1)):.1f}Dot")
    print(f"{s1}Too much ...{s2}I'm ...")
    return 0

parser = Parser("YOUR_CLIENT_ID",
    "YOUR_CLIENT_SECRET",
    "https://api.ce-cotoha.com/api/dev/nlp",
    "https://api.ce-cotoha.com/v1/oauth/accesstokens"
)
s = parser.parse(input())
if find_orochi_sentence(s.tokens) == 1:
    print("This is too humorous and not syntactic")

In the parsing of COTOHA API, the morpheme information [^ morpheme] of each word is obtained, but if the word is an unknown word, the information "Undef" is added to "features" in it. Based on that information, it is judged whether the noun part of the syntax is a general noun because it is too horsey.

Also, if kanji is included in the similarity calculation, there is a problem of notational fluctuation, so we compare using katakana readings. Therefore, if the COTOHA API recognizes that the reading is different from what you expected, it will not be judged correctly. (Example: It was too spicy and became a face)

Some syntax masters are too enthusiastic to deal with the problem of not being able to come up with a good word by saying "too much to become a cedar", but this is a sly thing and I will not evaluate it.

in conclusion

Now, whenever I'm too crazy to come up with a syntax, I'm able to get an objective evaluation.

This time, I tried using the COTOHA API for morphological analysis, but I found it convenient because it is easy to use and supports quite a lot of words. I think it is also great that XX is presumed to be a noun even if it is an unknown word in the part that "became XX". The free version has a limit on the number of API requests (1000 times a day), but I think that there is no problem with using it for play.

Everyone, please try using the syntax too much. Thank you very much.

reference

COTOHA API Portal -[Levenshtein Distance](https://ja.wikipedia.org/wiki/%E3%83%AC%E3%83%BC%E3%83%99%E3%83%B3%E3%82%B7% E3% 83% A5% E3% 82% BF% E3% 82% A4% E3% 83% B3% E8% B7% 9D% E9% 9B% A2) -Understand and implement editing distance (Levenshtein distance)

[^ tokkun]: Link to Youtube channel.

[^ tukkomi]: This is a Tsukkomi point, and I think there is a more proper implementation method.

[^ morpheme]: Official Reference. There is no mention of "Undef" here, so it may eventually become unusable ...