Perform entity analysis using spaCy / GiNZA in Python

What?

Let's try using spaCy / GiNZA, which is very convenient for natural language processing, or "entity analysis", which is the real pleasure of text analysis. This is the GiNZA page. https://megagonlabs.github.io/ginza/

Entity analysis is a technology to find a catamaly (entity) such as "Preste = game machine" and "FINAL FANTASY VII REMAKE = game name" when you say "** Play FINAL FANTASY VII REMAKE on Sony PlayStation **".

Creating a dictionary of game names is very difficult. The number of games will increase infinitely. We will find the entity while guessing it from the context before and after.

First try using GiNZA

First, let's use GiNZA. GiNZA is simply a library for Japanese analysis that has been learned and has all the necessary items.

Anyway, it's easy enough to use.

First, install it with pip.

pip install -U ginza

Actually, it took me a long time to trip over various places until pip install succeeded, but ... once it's done, it may actually pass in one shot. If you don't pass, even in the comments ...

Sample code

First is the first simple code.

import spacy

nlp = spacy.load('ja_ginza')  

doc = nlp(""FINAL FANTASY VII Remake" is a game software released by Square Enix. It was pre-sold on PlayStation 4 and is an exclusive title until April 2021. Initially scheduled to be released worldwide on March 3, 2020, the release was postponed on April 10, 2020.")

print("*** token ***")
for token in doc:
    print(token.i, token.orth_, token.lemma_, token.pos_, token.tag_, token.dep_, token.head.i)

print("*** entity ***")
for ent in doc.ents:
    print(ent.text, ent.label_)

The result looks like this. The detailed meaning is omitted here, but you can see that each word is analyzed while doing "morphological analysis".

*** token ***
0 "" PUNCT auxiliary symbol-Open parentheses punct 4
1 Final Final NOUN Noun-Appellative-General compound 4
2 fantasy fantasy NOUN noun-Appellative-General compound 4
3 VII vii NOUN noun-Appellative-General compound 4
4 Remake Remake NOUN noun-Appellative-Changeable ROOT 4
5 』” PUNCT auxiliary symbol-Parentheses closed punct 4
6 is the ADP particle-Particle case 4
7, ,, PUNCT auxiliary symbol-Comma punct 4
8 Square Enix Square Enix PROPN Noun-Proper noun-General compound 10
:(abridgement)

*** entity ***
FINAL FANTASY VII Remake Book
Square Enix Person
PlayStation 4 Product_Other
April 2021 Date
March 3, 2020 Date
April 10, the same year Date

It feels pretty good.

So-called morphological analysis is performed neatly as a token, and Square Enix is also recognized as a proper noun.

The entity I wanted to do this time is also recognized as "FINAL FANTASY VII Remake". The word Book is a little strange, but ... it's general dictionary data, so it can't be helped, and even a slightly tricky way of writing a date such as "April 10 of the same year" recognizes it as a Date. If you want to retrieve common words, this should be enough.

Create a custom dictionary

However, there are times when you want to change "** FINAL FANTASY VII Remake " to " Game_Title **".

When doing natural language processing in actual work, I think there are technical terms for each of our business domains. For example, I want to treat the title as a title. I would like to do something about it.

Now that I want to learn for myself, I will study with spaCy's original ja instead of GiNZA ja_ginza. The code is almost the same as the Spacy sample code, but it looks like this.

from __future__ import unicode_literals, print_function

import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding

# new entity label
LABEL = "Game_Title"

TRAIN_DATA = [
    (
        ""FINAL FANTASY VII Remake" is a game software released by Square Enix.",
        {"entities": [(1, 20, LABEL)]}
    ),
    (
        "This is the official website of the remake work of "FINAL FANTASY VII Remake".",
        {"entities": [(1, 20, LABEL)]}
    ),
    (
        "FINAL FANTASY VII Remake-PS4 is always a bargain at the game store.",
        {"entities": [(0, 19, LABEL)]}
    )
]

random.seed(0)
nlp = spacy.blank("ja")
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
ner.add_label(LABEL)
optimizer = nlp.begin_training()
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes):
    for itn in range(30):
        random.shuffle(TRAIN_DATA)
        losses = {}
        batches = minibatch(TRAIN_DATA, size=compounding(1.0, 4.0, 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
        print("Losses", losses)

print()

test_text = "Following "FINAL FANTASY VII Remake", "FINAL FANTASY II"!"
doc = nlp(test_text)
print("Entities in '%s'" % test_text)
for ent in doc.ents:
    print(ent.text, ent.label_) 

output_dir = Path(r"Appropriate folder name")
nlp.meta["name"] = "GameTitleModel"
nlp.to_disk(output_dir)
print("Saved model to", output_dir)

The result looks like this. For the time being, it was recognized as "FINAL FANTASY VII Remake Game_Title".

Of course, such a small amount of training data is not enough, but it is enough to try it.

Losses {'ner': 36.54436391592026}
Losses {'ner': 28.74292328953743}
Losses {'ner': 16.96098183095455}
   :
Entities in 'Following "FINAL FANTASY VII Remake", "FINAL FANTASY II"!'
FINAL FANTASY VII Remake Game_Title
FINAL FANTASY II Game_Title

I was able to do it for the time being. "FINAL FANTASY II" that I haven't learned is also a Game_Title.

So, I'm sorry, it's a little rough, so I will update it from time to time after it is released.