[JAVA] Comments on the COTOHA Parsing API

There are multiple sources

Post 2 Parsing the COTOHA API in Java

I noticed while testing this, but in the API demo, about the analysis result of "My daughter-in-law and daughter went on a trip."

image.png

I was wondering if this clearly represents all the correct answers, but in reality


{
      "id" : 2,
      "form" : "Daughter",
      "kana" : "Musume",
      "lemma" : "Daughter",
      "pos" : "noun",
      "dependency_labels" : [ {
        "token_id" : 0,
        "label" : "conj"
      }, {
        "token_id" : 3,
        "label" : "case"
      } ],
      "attributes" : { }
    }

It is returned as, and it looks like the red line below in the figure.

image.png

Since the JSON attribute name is "dependency_labels", it is easy to understand that there are multiple, but if you look only at the demo, it seems that there are not multiple, so I thought that you need to be careful. It also seemed that the demo didn't fully convey the appeal of the API.

Only one document can be sent

Only one "document" (example: multiple logs in the call center) that can be specified as the processing target is one API call. When trying to process a large number of documents, I want to process multiple documents at once instead of one by one, so I thought that this point should also be noted. (Example: It is inappropriate to concatenate and parse the call logs of different customers.) Expect the next version to be able to process multiple documents.

image.png

Behavior when parsing multiple sentences

When I send "My daughter-in-law and my daughter went on a trip. I and my son ate grilled meat.", The following response is returned.


{
  "result" : [ {
    "chunk_info" : {
      "id" : 0,
      "head" : 1,
      "dep" : "P",
      "chunk_head" : 0,
      "chunk_func" : 1,
      "links" : [ ]
    },
    "tokens" : [ {
      "id" : 0,
      "form" : "Daughter-in-law",
      "kana" : "Yome",
      "lemma" : "Daughter-in-law",
      "pos" : "noun",
      "features" : [ ],
      "common_noun_semantic" : [ 49, 76, 88 ],
      "proper_noun_semantic" : [ ],
      "declinable_word_semantic" : [ ],
      "dependency_labels" : [ {
        "token_id" : 1,
        "label" : "cc"
      } ],
      "attributes" : { }
    }, {
      "id" : 1,
      "form" : "When",
      "kana" : "To",
      "lemma" : "When",
      "pos" : "Case particles",
      "features" : [ "Continuous use" ],
      "common_noun_semantic" : [ ],
      "proper_noun_semantic" : [ ],
      "declinable_word_semantic" : [ ],
      "attributes" : { }
    } ]
  }, {
    "chunk_info" : {
      "id" : 1,
      "head" : 7,
      "dep" : "D",
      "chunk_head" : 0,
      "chunk_func" : 1,
      "links" : [ {
        "link" : 0,
        "label" : "other"
      } ]
    },
    "tokens" : [ {
      "id" : 2,
      "form" : "Daughter",
      "kana" : "Musume",
      "lemma" : "Daughter",
      "pos" : "noun",
      "features" : [ ],
      "common_noun_semantic" : [ 49, 59, 88 ],
      "proper_noun_semantic" : [ ],
      "declinable_word_semantic" : [ ],
      "dependency_labels" : [ {
        "token_id" : 0,
        "label" : "conj"
      }, {
        "token_id" : 3,
        "label" : "case"
      } ],
      "attributes" : { }
    }, {
      "id" : 3,
      "form" : "Is",
      "kana" : "C",
      "lemma" : "Is",
      "pos" : "Conjunctive particles",
      "features" : [ ],
      "common_noun_semantic" : [ ],
      "proper_noun_semantic" : [ ],
      "declinable_word_semantic" : [ ],
      "attributes" : { }
    } ]
  }, {
    "chunk_info" : {
      "id" : 2,
      "head" : 3,
      "dep" : "D",
      "chunk_head" : 0,
      "chunk_func" : 1,
      "links" : [ ]
    },
    "tokens" : [ {
      "id" : 4,
      "form" : "Travel",
      "kana" : "Ryoko",
      "lemma" : "Travel",
      "pos" : "noun",
      "features" : [ "motion" ],
      "common_noun_semantic" : [ 1658, 1659, 1660 ],
      "proper_noun_semantic" : [ ],
      "declinable_word_semantic" : [ 18 ],
      "dependency_labels" : [ {
        "token_id" : 5,
        "label" : "case"
      } ],
      "attributes" : { }
    }, {
      "id" : 5,
      "form" : "To",
      "kana" : "D",
      "lemma" : "To",
      "pos" : "Case particles",
      "features" : [ "Continuous use" ],
      "common_noun_semantic" : [ ],
      "proper_noun_semantic" : [ ],
      "declinable_word_semantic" : [ ],
      "attributes" : { }
    } ]
  }, {
    "chunk_info" : {
      "id" : 3,
      "head" : 7,
      "dep" : "P",
      "chunk_head" : 0,
      "chunk_func" : 2,
      "links" : [ {
        "link" : 2,
        "label" : "purpose"
      } ],
      "predicate" : [ "past" ]
    },
    "tokens" : [ {
      "id" : 6,
      "form" : "line",
      "kana" : "I",
      "lemma" : "go",
      "pos" : "Verb stem",
      "features" : [ "IKU" ],
      "common_noun_semantic" : [ 2053, 2132 ],
      "proper_noun_semantic" : [ ],
      "declinable_word_semantic" : [ 15, 20, 29, 32, 5 ],
      "dependency_labels" : [ {
        "token_id" : 4,
        "label" : "nmod"
      }, {
        "token_id" : 7,
        "label" : "aux"
      }, {
        "token_id" : 8,
        "label" : "aux"
      }, {
        "token_id" : 9,
        "label" : "punct"
      } ],
      "attributes" : { }
    }, {
      "id" : 7,
      "form" : "Tsu",
      "kana" : "Tsu",
      "lemma" : "Tsu",
      "pos" : "Verb conjugation ending",
      "features" : [ ],
      "common_noun_semantic" : [ ],
      "proper_noun_semantic" : [ ],
      "declinable_word_semantic" : [ ],
      "attributes" : { }
    }, {
      "id" : 8,
      "form" : "Ta",
      "kana" : "Ta",
      "lemma" : "Ta",
      "pos" : "Verb suffix",
      "features" : [ "stop" ],
      "common_noun_semantic" : [ ],
      "proper_noun_semantic" : [ ],
      "declinable_word_semantic" : [ ],
      "attributes" : { }
    }, {
      "id" : 9,
      "form" : "。",
      "kana" : "",
      "lemma" : "。",
      "pos" : "Kuten",
      "features" : [ ],
      "common_noun_semantic" : [ ],
      "proper_noun_semantic" : [ ],
      "declinable_word_semantic" : [ ],
      "attributes" : { }
    } ]
  }, {
    "chunk_info" : {
      "id" : 4,
      "head" : 7,
      "dep" : "D",
      "chunk_head" : 0,
      "chunk_func" : 1,
      "links" : [ ]
    },
    "tokens" : [ {
      "id" : 10,
      "form" : "I",
      "kana" : "I",
      "lemma" : "I",
      "pos" : "noun",
      "features" : [ "Pronoun" ],
      "common_noun_semantic" : [ 37, 8 ],
      "proper_noun_semantic" : [ ],
      "declinable_word_semantic" : [ ],
      "dependency_labels" : [ {
        "token_id" : 11,
        "label" : "cc"
      } ],
      "attributes" : { }
    }, {
      "id" : 11,
      "form" : "When",
      "kana" : "To",
      "lemma" : "When",
      "pos" : "Case particles",
      "features" : [ "Continuous use" ],
      "common_noun_semantic" : [ ],
      "proper_noun_semantic" : [ ],
      "declinable_word_semantic" : [ ],
      "attributes" : { }
    } ]
  }, {
    "chunk_info" : {
      "id" : 5,
      "head" : 7,
      "dep" : "D",
      "chunk_head" : 0,
      "chunk_func" : 1,
      "links" : [ ]
    },
    "tokens" : [ {
      "id" : 12,
      "form" : "son",
      "kana" : "Musco",
      "lemma" : "son",
      "pos" : "noun",
      "features" : [ ],
      "common_noun_semantic" : [ 48, 58, 87 ],
      "proper_noun_semantic" : [ ],
      "declinable_word_semantic" : [ ],
      "dependency_labels" : [ {
        "token_id" : 13,
        "label" : "case"
      } ],
      "attributes" : { }
    }, {
      "id" : 13,
      "form" : "Is",
      "kana" : "C",
      "lemma" : "Is",
      "pos" : "Conjunctive particles",
      "features" : [ ],
      "common_noun_semantic" : [ ],
      "proper_noun_semantic" : [ ],
      "declinable_word_semantic" : [ ],
      "attributes" : { }
    } ]
  }, {
    "chunk_info" : {
      "id" : 6,
      "head" : 7,
      "dep" : "D",
      "chunk_head" : 0,
      "chunk_func" : 1,
      "links" : [ ]
    },
    "tokens" : [ {
      "id" : 14,
      "form" : "Roasted meat",
      "kana" : "Yakiniku",
      "lemma" : "Roasted meat",
      "pos" : "noun",
      "features" : [ ],
      "common_noun_semantic" : [ 843, 852 ],
      "proper_noun_semantic" : [ ],
      "declinable_word_semantic" : [ ],
      "dependency_labels" : [ {
        "token_id" : 15,
        "label" : "case"
      } ],
      "attributes" : { }
    }, {
      "id" : 15,
      "form" : "To",
      "kana" : "Wo",
      "lemma" : "To",
      "pos" : "Case particles",
      "features" : [ "Continuous use" ],
      "common_noun_semantic" : [ ],
      "proper_noun_semantic" : [ ],
      "declinable_word_semantic" : [ ],
      "attributes" : { }
    } ]
  }, {
    "chunk_info" : {
      "id" : 7,
      "head" : -1,
      "dep" : "O",
      "chunk_head" : 0,
      "chunk_func" : 1,
      "links" : [ {
        "link" : 1,
        "label" : "agent"
      }, {
        "link" : 3,
        "label" : "manner"
      }, {
        "link" : 4,
        "label" : "coagent"
      }, {
        "link" : 5,
        "label" : "agent"
      }, {
        "link" : 6,
        "label" : "object"
      } ],
      "predicate" : [ "past" ]
    },
    "tokens" : [ {
      "id" : 16,
      "form" : "eat",
      "kana" : "Tabe",
      "lemma" : "eat",
      "pos" : "Verb stem",
      "features" : [ "A" ],
      "common_noun_semantic" : [ 1581, 1590 ],
      "proper_noun_semantic" : [ ],
      "declinable_word_semantic" : [ 2, 23 ],
      "dependency_labels" : [ {
        "token_id" : 2,
        "label" : "nsubj"
      }, {
        "token_id" : 6,
        "label" : "advcl"
      }, {
        "token_id" : 10,
        "label" : "nmod"
      }, {
        "token_id" : 12,
        "label" : "nsubj"
      }, {
        "token_id" : 14,
        "label" : "dobj"
      }, {
        "token_id" : 17,
        "label" : "aux"
      }, {
        "token_id" : 18,
        "label" : "punct"
      } ],
      "attributes" : { }
    }, {
      "id" : 17,
      "form" : "Ta",
      "kana" : "Ta",
      "lemma" : "Ta",
      "pos" : "Verb suffix",
      "features" : [ "stop" ],
      "common_noun_semantic" : [ ],
      "proper_noun_semantic" : [ ],
      "declinable_word_semantic" : [ ],
      "attributes" : { }
    }, {
      "id" : 18,
      "form" : "。",
      "kana" : "",
      "lemma" : "。",
      "pos" : "Kuten",
      "features" : [ ],
      "common_noun_semantic" : [ ],
      "proper_noun_semantic" : [ ],
      "declinable_word_semantic" : [ ],
      "attributes" : { }
    } ]
  } ],
  "status" : 0,
  "message" : ""
}


If you look at "Go" here, there are four destinations, "4,7,8,9". Actually, it should be 6 including "0,2", but the API returns 4 as well.

Originally expected result image.png

You'll get the results you expect if you send a single statement instead of sending multiple statements at the same time.

The destination of "go" when only "the bride and daughter went on a trip" was sent

{
      "id" : 6,
      "form" : "line",
      "kana" : "I",
      "lemma" : "go",
      "pos" : "Verb stem",
      "features" : [ "IKU" ],
      "common_noun_semantic" : [ 2053, 2132 ],
      "proper_noun_semantic" : [ ],
      "declinable_word_semantic" : [ 15, 20, 29, 32, 5 ],
      "dependency_labels" : [ {
        "token_id" : 0,
        "label" : "nmod"
      }, {
        "token_id" : 2,
        "label" : "nsubj"
      }, {
        "token_id" : 4,
        "label" : "nmod"
      }, {
        "token_id" : 7,
        "label" : "aux"
      }, {
        "token_id" : 8,
        "label" : "aux"
      }, {
        "token_id" : 9,
        "label" : "punct"
      } ],
      "attributes" : { }
    }

Is the parsing a bit suspicious when I send multiple sentences, or is there a problem with my calling? .. Investigation required.

The destination of "go" when another sentence is concatenated and sent after "the bride and daughter went on a trip."

{
      "id" : 6,
      "form" : "line",
      "kana" : "I",
      "lemma" : "go",
      "pos" : "Verb stem",
      "features" : [ "IKU" ],
      "common_noun_semantic" : [ 2053, 2132 ],
      "proper_noun_semantic" : [ ],
      "declinable_word_semantic" : [ 15, 20, 29, 32, 5 ],
      "dependency_labels" : [ {
        "token_id" : 4,
        "label" : "nmod"
      }, {
        "token_id" : 7,
        "label" : "aux"
      }, {
        "token_id" : 8,
        "label" : "aux"
      }, {
        "token_id" : 9,
        "label" : "punct"
      } ],
      "attributes" : { }
    }

If you look at spec, it says "sentence: sentence to be parsed", and you can read it as a single sentence. However, "separating by sentences" is also one of the natural language processing, so I thought that if only a single sentence was targeted, it would be an issue as an API specification. In the field of text mining, unlike the academic world, I think that it is rare to process only a single sentence.

image.png

For example, Stanford NLP also returns sentence breaks as annotations. (For example, "I went to see Morning Musume's live performance" must be analyzed correctly.)

If I'm a professional user, contact support. If I was in charge of delivery, I would urge you to fix it. If I were a development leader, it would be fixed immediately ^ _ ^ ;;

Link

COTOHA API Portal

that's all

Recommended Posts

Comments on the COTOHA Parsing API
Parsing the COTOHA API in Java
Try using the COTOHA API parsing in Java
How to disable Set-Cookie from API on the front side
The API looks like this!
Scala runs on the JVM