Introduction

** Named entity recognition ** is a technology that extracts proper nouns such as ** personal names ** and ** place names ** that appear in texts, and numerical expressions such as ** date ** and ** time **. .. Named entity recognition is also used as an elemental technology for applied applications that use natural language processing such as ** question answering system **, ** dialogue system **, and ** information extraction **.

This time, I will make a named entity extractor using ** machine learning technology **.

※Notes No theoretical story comes out. If you want to know the theory, please hit the other. </ font>

Target audience

Those who know a little about named entity extraction
Those who want to make a named entity extractor
Those who can read Python code

What is named entity recognition?

This section provides an overview and method of named entity recognition.

Overview

Named entity extraction is a technology that extracts proper nouns such as personal names and place names that appear in texts, and numerical expressions such as dates and times. Let's look at a concrete example. Let's extract the named entity from the following sentence.

Taro went to see Hanako at 9 am on May 18th.

Extracting the named entities contained in the above sentence, as ** personal name **, ** Taro ** and ** Hanako **, ** date **, ** May 18 **, ** time ** ** 9 am ** can be extracted.

In the above example, the person name, date, and time were extracted as named entity classes. In general, the following eight classes (Information Retrieval and Extraction Exercise (IREX) Named Entity Extraction Task Definition) in: //nlp.cs.nyu.edu/irex/NE/) is often used.

class	Example
ART unique product name	Nobel Prize in Literature, Windows 7
LOC place name	Chiba, USA
ORG organization	Liberal Democratic Party, NHK
PSN personal name	Shinzo Abe, Merkel
DAT date	January 29, 2016/01/29
TIM time	3 pm, 10:30
MNY amount	241 yen, $ 8
PNT percentage	10%, 30%

Method

One way to extract named entities is to label sentences that have been morphologically analyzed. The following is an example of labeling the sentence "Taro is at 9 am on May 18th ..." after morphological analysis.

The labels B-XXX and I-XXX indicate that these strings are named entities. B-XXX means the beginning of the named entity string, and I-XXX means that the named entity string continues. The XXX part contains named entity classes such as ORG and PSN. Parts that are not named entities are labeled O.

Labeling can be done using rules, but this time it will be done using ** machine learning technology **. That is, it creates a model from pre-labeled training data and uses that model to label unlabeled sentences. Specifically, we will learn using an algorithm called CRF.

Let's actually move our hands.

Installation

Start by installing the required Python modules. Execute the following command in the terminal to install the module. I have CRFsuite installed as a CRF library.

pip install numpy
pip install scipy
pip install sklearn
pip install python-crfsuite

Once installed, import the required modules. Execute the following code.

from itertools import chain
import pycrfsuite
import sklearn
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelBinarizer

Data used to build named entity extractor

Since CRF is supervised learning, we need data tagged with teacher data. This time, I prepared the tagged data in advance. Please download from here. The file name is "hironsan.txt".

Now, let's first define a class to read the downloaded data.

import codecs

class CorpusReader(object):

    def __init__(self, path):
        with codecs.open(path, encoding='utf-8') as f:
            sent = []
            sents = []
            for line in f:
                if line	== '\n':
                    sents.append(sent)
                    sent = []
                    continue
                morph_info = line.strip().split('\t')
                sent.append(morph_info)
        train_num = int(len(sents) * 0.9)
        self.__train_sents = sents[:train_num]
        self.__test_sents = sents[train_num:]

    def iob_sents(self, name):
        if name == 'train':
            return self.__train_sents
        elif name == 'test':
            return self.__test_sents
        else:
            return None

Next, load the downloaded data using the created class. The number of training data is 450 sentences and the number of test data is 50 sentences.

c = CorpusReader('hironsan.txt')
train_sents = c.iob_sents('train')
test_sents = c.iob_sents('test')

The format of the read data is as follows. The IOB2 tag is attached after performing morphological analysis with the morphological analyzer "MeCab". The data is divided into sentences, and each sentence consists of a collection of multiple morpheme information.

>>> train_sents[0]
[['2005', 'noun', 'number', '*', '*', '*', '*', '*', 'B-DAT'],
 ['Year', 'noun', 'suffix', 'Classifier', '*', '*', '*', 'Year', 'Nen', 'Nen', 'I-DAT'],
 ['7', 'noun', 'number', '*', '*', '*', '*', '*', 'I-DAT'],
 ['Month', 'noun', 'General', '*', '*', '*', '*', 'Month', 'Moon', 'Moon', 'I-DAT'],
 ['14', 'noun', 'number', '*', '*', '*', '*', '*', 'I-DAT'],
 ['Day', 'noun', 'suffix', 'Classifier', '*', '*', '*', 'Day', 'Nichi', 'Nichi', 'I-DAT'],
 ['、', 'symbol', 'Comma', '*', '*', '*', '*', '、', '、', '、', 'O'],
...
]

Next, I will explain the features used for named entity extraction.

Features to use

Here, we will give an overview of the features to be used and then code it.

Overview

Next, I will explain the features to be used. This time, we will use two-letter words before and after, subclass of part of speech, character type, and named entity tag. An example of using these features is shown below. The part surrounded by the frame is the feature used.

The classification of character types is as follows. There are 7 types in all.

Character type tag	Description
ZSPACE	Blank
ZDIGIT	Arabic numerals
ZLLET	Lowercase letters
ZULET	Uppercase letters
HIRAG	Hiragana
KATAK	Katakana
OTHER	Other

The character type used as a feature is a combination of all the character types contained in a word. For example, the word "many" includes kanji and hiragana. The hiragana character type tag is HIRAG, and the kanji character type tag is OTHER. Therefore, the character type of the word "many" is "HIRAG-OTHER".

Coding of feature extraction

Judgment of character type

The code for determining the character type is as follows. All character types contained in the string are combined with a- (hyphen).

def is_hiragana(ch):
    return 0x3040 <= ord(ch) <= 0x309F

def is_katakana(ch):
    return 0x30A0 <= ord(ch) <= 0x30FF

def get_character_type(ch):
    if ch.isspace():
        return 'ZSPACE'
    elif ch.isdigit():
        return 'ZDIGIT'
    elif ch.islower():
        return 'ZLLET'
    elif ch.isupper():
        return 'ZULET'
    elif is_hiragana(ch):
        return 'HIRAG'
    elif is_katakana(ch):
        return 'KATAK'
    else:
        return 'OTHER'

def get_character_types(string):
    character_types = map(get_character_type, string)
    character_types_str = '-'.join(sorted(set(character_types)))

    return character_types_str

Extraction of part of speech subclassification

The code to extract the part of speech subclassification from the morpheme information is as follows.

def extract_pos_with_subtype(morph):
    idx = morph.index('*')

    return '-'.join(morph[1:idx])

Feature extraction from sentences

Based on the above, the code to extract the features for each word is as follows. It's a bit verbose, but you can see.

def word2features(sent, i):
    word = sent[i][0]
    chtype = get_character_types(sent[i][0])
    postag = extract_pos_with_subtype(sent[i])
    features = [
        'bias',
        'word=' + word,
        'type=' + chtype,
        'postag=' + postag,
    ]
    if i >= 2:
        word2 = sent[i-2][0]
        chtype2 = get_character_types(sent[i-2][0])
        postag2 = extract_pos_with_subtype(sent[i-2])
        iobtag2 = sent[i-2][-1]
        features.extend([
            '-2:word=' + word2,
            '-2:type=' + chtype2,
            '-2:postag=' + postag2,
            '-2:iobtag=' + iobtag2,
        ])
    else:
        features.append('BOS')

    if i >= 1:
        word1 = sent[i-1][0]
        chtype1 = get_character_types(sent[i-1][0])
        postag1 = extract_pos_with_subtype(sent[i-1])
        iobtag1 = sent[i-1][-1]
        features.extend([
            '-1:word=' + word1,
            '-1:type=' + chtype1,
            '-1:postag=' + postag1,
            '-1:iobtag=' + iobtag1,
        ])
    else:
        features.append('BOS')

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        chtype1 = get_character_types(sent[i+1][0])
        postag1 = extract_pos_with_subtype(sent[i+1])
        features.extend([
            '+1:word=' + word1,
            '+1:type=' + chtype1,
            '+1:postag=' + postag1,
        ])
    else:
        features.append('EOS')

    if i < len(sent)-2:
        word2 = sent[i+2][0]
        chtype2 = get_character_types(sent[i+2][0])
        postag2 = extract_pos_with_subtype(sent[i+2])
        features.extend([
            '+2:word=' + word2,
            '+2:type=' + chtype2,
            '+2:postag=' + postag2,
        ])
    else:
        features.append('EOS')

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]


def sent2labels(sent):
    return [morph[-1] for morph in sent]


def sent2tokens(sent):
    return [morph[0] for morph in sent]

Extract features from sentences with sent2features. The features that are actually extracted are as follows.

>>> sent2features(train_sents[0])[0]
['bias',
 'word=2005',
 'type=ZDIGIT',
 'postag=noun-number',
 'BOS',
 'BOS',
 '+1:word=Year',
 '+1:type=OTHER',
 '+1:postag=noun-suffix-Classifier',
 '+2:word=7',
 '+2:type=ZDIGIT',
 '+2:postag=noun-number']

It turned out that the features can be extracted from the data. Extract the features and labels for the training and test data from the data for later use.

X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

Model learning

To train the model, create a pycrfsuite.Trainer object, load the training data, and then call the train method. First, create a Trainer object and read the training data.

trainer = pycrfsuite.Trainer(verbose=False)

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

Next, set the learning parameters. Originally, it should be decided using development data, but this time it will be fixed.

trainer.set_params({
    'c1': 1.0,   # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': 50,  # stop earlier

    # include transitions that are possible, but not observed
    'feature.possible_transitions': True
})

Now that we're ready, let's train the model. Specify the file name and execute the train method.

trainer.train('model.crfsuite')

When the execution is finished, a file with the specified file name will be created. The trained model is stored in this.

Test data prediction

To use the trained model, create a pycrfsuite.Tagger object, load the trained model, and use the tag method. First, create a Tagger object and load the trained model.

tagger = pycrfsuite.Tagger()
tagger.open('model.crfsuite')

Now, let's actually tag the sentence.

example_sent = test_sents[0]
print(' '.join(sent2tokens(example_sent)))

print("Predicted:", ' '.join(tagger.tag(sent2features(example_sent))))
print("Correct:  ", ' '.join(sent2labels(example_sent)))

You should get the following result: Predicted is the tag sequence predicted using the created model, and Correct is the correct tag sequence. In the case of this sentence, the expected result of the model and the correct answer data matched.

In October last year, 34 people were killed in an explosion in Taba, Egypt, near the site.
Predicted: B-DAT I-DAT I-DAT O O O O O O O O O O O O B-LOC O B-LOC O O O O O O O O O O
Correct:   B-DAT I-DAT I-DAT O O O O O O O O O O O O B-LOC O B-LOC O O O O O O O O O O

This completes the construction of the named entity extractor.

Model evaluation

I created a model, but I don't know if this is good or bad. Therefore, it is important to evaluate the model you created. Now let's evaluate the created model. Evaluation is based on precision, recall, and F-number. Below is the code to evaluate.

def bio_classification_report(y_true, y_pred):
    lb = LabelBinarizer()
    y_true_combined = lb.fit_transform(list(chain.from_iterable(y_true)))
    y_pred_combined = lb.transform(list(chain.from_iterable(y_pred)))
        
    tagset = set(lb.classes_) - {'O'}
    tagset = sorted(tagset, key=lambda tag: tag.split('-', 1)[::-1])
    class_indices = {cls: idx for idx, cls in enumerate(lb.classes_)}
    
    return classification_report(
        y_true_combined,
        y_pred_combined,
        labels = [class_indices[cls] for cls in tagset],
        target_names = tagset,
    )

Tag statements in the test data set for use in evaluation.

y_pred = [tagger.tag(xseq) for xseq in X_test]

The data tagged using the trained model and the correct answer data are passed to the evaluation function and the result is displayed. For each category, the precision rate, recall rate, F value, and number of tags are displayed.

>>> print(bio_classification_report(y_test, y_pred))
             precision    recall  f1-score   support

      B-ART       1.00      0.89      0.94         9
      I-ART       0.92      1.00      0.96        12
      B-DAT       1.00      1.00      1.00        12
      I-DAT       1.00      1.00      1.00        22
      B-LOC       1.00      0.95      0.97        55
      I-LOC       0.94      0.94      0.94        17
      B-ORG       0.75      0.86      0.80        14
      I-ORG       1.00      0.90      0.95        10
      B-PSN       0.00      0.00      0.00         3
      B-TIM       1.00      0.71      0.83         7
      I-TIM       1.00      0.81      0.90        16

avg / total       0.95      0.91      0.93       177

I think the result is a little too good, but the data used probably contained similar statements.

※Caution You may get an UndefinedMetricWarning. It seems that it is not possible to define the precision rate etc. for labels that do not exist in the predicted sample. Because the number of data prepared is small ...

in conclusion

This time, I was able to easily create a named entity extractor by using the Python library crfsuite. Eight kinds of named entities are added to the tagging based on the definition of IREX. However, the definition of IRX is often rough for practical use. Therefore, if you want to use named entity recognition for some task, you need to prepare the data with the necessary tags according to the task.

You may also want to look for better features and model parameters.

[Tutorial] Make a named entity extractor in 30 minutes using machine learning

Introduction

Target audience

What is named entity recognition?

Overview

Method

Installation

Data used to build named entity extractor

Features to use

Overview

Coding of feature extraction

Judgment of character type

Extraction of part of speech subclassification

Feature extraction from sentences

Model learning

Test data prediction

Model evaluation

in conclusion

reference