** Named entity recognition ** is a technology that extracts proper nouns such as ** personal names ** and ** place names ** that appear in texts, and numerical expressions such as ** date ** and ** time **. .. Named entity recognition is also used as an elemental technology for applied applications that use natural language processing such as ** question answering system **, ** dialogue system **, and ** information extraction **.
This time, I will make a named entity extractor using ** machine learning technology **.
※Notes No theoretical story comes out. If you want to know the theory, please hit the other. </ font>
This section provides an overview and method of named entity recognition.
Named entity extraction is a technology that extracts proper nouns such as personal names and place names that appear in texts, and numerical expressions such as dates and times. Let's look at a concrete example. Let's extract the named entity from the following sentence.
Taro went to see Hanako at 9 am on May 18th.
Extracting the named entities contained in the above sentence, as ** personal name **, ** Taro ** and ** Hanako **, ** date **, ** May 18 **, ** time ** ** 9 am ** can be extracted.
In the above example, the person name, date, and time were extracted as named entity classes. In general, the following eight classes (Information Retrieval and Extraction Exercise (IREX) Named Entity Extraction Task Definition) in: //nlp.cs.nyu.edu/irex/NE/) is often used.
class | Example |
---|---|
ART unique product name | Nobel Prize in Literature, Windows 7 |
LOC place name | Chiba, USA |
ORG organization | Liberal Democratic Party, NHK |
PSN personal name | Shinzo Abe, Merkel |
DAT date | January 29, 2016/01/29 |
TIM time | 3 pm, 10:30 |
MNY amount | 241 yen, $ 8 |
PNT percentage | 10%, 30% |
One way to extract named entities is to label sentences that have been morphologically analyzed. The following is an example of labeling the sentence "Taro is at 9 am on May 18th ..." after morphological analysis.
The labels B-XXX and I-XXX indicate that these strings are named entities. B-XXX means the beginning of the named entity string, and I-XXX means that the named entity string continues. The XXX part contains named entity classes such as ORG and PSN. Parts that are not named entities are labeled O.
Labeling can be done using rules, but this time it will be done using ** machine learning technology **. That is, it creates a model from pre-labeled training data and uses that model to label unlabeled sentences. Specifically, we will learn using an algorithm called CRF.
Let's actually move our hands.
Start by installing the required Python modules. Execute the following command in the terminal to install the module. I have CRFsuite installed as a CRF library.
pip install numpy
pip install scipy
pip install sklearn
pip install python-crfsuite
Once installed, import the required modules. Execute the following code.
from itertools import chain
import pycrfsuite
import sklearn
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelBinarizer
Since CRF is supervised learning, we need data tagged with teacher data. This time, I prepared the tagged data in advance. Please download from here. The file name is "hironsan.txt".
Now, let's first define a class to read the downloaded data.
import codecs
class CorpusReader(object):
def __init__(self, path):
with codecs.open(path, encoding='utf-8') as f:
sent = []
sents = []
for line in f:
if line == '\n':
sents.append(sent)
sent = []
continue
morph_info = line.strip().split('\t')
sent.append(morph_info)
train_num = int(len(sents) * 0.9)
self.__train_sents = sents[:train_num]
self.__test_sents = sents[train_num:]
def iob_sents(self, name):
if name == 'train':
return self.__train_sents
elif name == 'test':
return self.__test_sents
else:
return None
Next, load the downloaded data using the created class. The number of training data is 450 sentences and the number of test data is 50 sentences.
c = CorpusReader('hironsan.txt')
train_sents = c.iob_sents('train')
test_sents = c.iob_sents('test')
The format of the read data is as follows. The IOB2 tag is attached after performing morphological analysis with the morphological analyzer "MeCab". The data is divided into sentences, and each sentence consists of a collection of multiple morpheme information.
>>> train_sents[0]
[['2005', 'noun', 'number', '*', '*', '*', '*', '*', 'B-DAT'],
['Year', 'noun', 'suffix', 'Classifier', '*', '*', '*', 'Year', 'Nen', 'Nen', 'I-DAT'],
['7', 'noun', 'number', '*', '*', '*', '*', '*', 'I-DAT'],
['Month', 'noun', 'General', '*', '*', '*', '*', 'Month', 'Moon', 'Moon', 'I-DAT'],
['14', 'noun', 'number', '*', '*', '*', '*', '*', 'I-DAT'],
['Day', 'noun', 'suffix', 'Classifier', '*', '*', '*', 'Day', 'Nichi', 'Nichi', 'I-DAT'],
['、', 'symbol', 'Comma', '*', '*', '*', '*', '、', '、', '、', 'O'],
...
]
Next, I will explain the features used for named entity extraction.
Here, we will give an overview of the features to be used and then code it.
Next, I will explain the features to be used. This time, we will use two-letter words before and after, subclass of part of speech, character type, and named entity tag. An example of using these features is shown below. The part surrounded by the frame is the feature used.
The classification of character types is as follows. There are 7 types in all.
Character type tag | Description |
---|---|
ZSPACE | Blank |
ZDIGIT | Arabic numerals |
ZLLET | Lowercase letters |
ZULET | Uppercase letters |
HIRAG | Hiragana |
KATAK | Katakana |
OTHER | Other |
The character type used as a feature is a combination of all the character types contained in a word. For example, the word "many" includes kanji and hiragana. The hiragana character type tag is HIRAG, and the kanji character type tag is OTHER. Therefore, the character type of the word "many" is "HIRAG-OTHER".
The code for determining the character type is as follows. All character types contained in the string are combined with a- (hyphen).
def is_hiragana(ch):
return 0x3040 <= ord(ch) <= 0x309F
def is_katakana(ch):
return 0x30A0 <= ord(ch) <= 0x30FF
def get_character_type(ch):
if ch.isspace():
return 'ZSPACE'
elif ch.isdigit():
return 'ZDIGIT'
elif ch.islower():
return 'ZLLET'
elif ch.isupper():
return 'ZULET'
elif is_hiragana(ch):
return 'HIRAG'
elif is_katakana(ch):
return 'KATAK'
else:
return 'OTHER'
def get_character_types(string):
character_types = map(get_character_type, string)
character_types_str = '-'.join(sorted(set(character_types)))
return character_types_str
The code to extract the part of speech subclassification from the morpheme information is as follows.
def extract_pos_with_subtype(morph):
idx = morph.index('*')
return '-'.join(morph[1:idx])
Based on the above, the code to extract the features for each word is as follows. It's a bit verbose, but you can see.
def word2features(sent, i):
word = sent[i][0]
chtype = get_character_types(sent[i][0])
postag = extract_pos_with_subtype(sent[i])
features = [
'bias',
'word=' + word,
'type=' + chtype,
'postag=' + postag,
]
if i >= 2:
word2 = sent[i-2][0]
chtype2 = get_character_types(sent[i-2][0])
postag2 = extract_pos_with_subtype(sent[i-2])
iobtag2 = sent[i-2][-1]
features.extend([
'-2:word=' + word2,
'-2:type=' + chtype2,
'-2:postag=' + postag2,
'-2:iobtag=' + iobtag2,
])
else:
features.append('BOS')
if i >= 1:
word1 = sent[i-1][0]
chtype1 = get_character_types(sent[i-1][0])
postag1 = extract_pos_with_subtype(sent[i-1])
iobtag1 = sent[i-1][-1]
features.extend([
'-1:word=' + word1,
'-1:type=' + chtype1,
'-1:postag=' + postag1,
'-1:iobtag=' + iobtag1,
])
else:
features.append('BOS')
if i < len(sent)-1:
word1 = sent[i+1][0]
chtype1 = get_character_types(sent[i+1][0])
postag1 = extract_pos_with_subtype(sent[i+1])
features.extend([
'+1:word=' + word1,
'+1:type=' + chtype1,
'+1:postag=' + postag1,
])
else:
features.append('EOS')
if i < len(sent)-2:
word2 = sent[i+2][0]
chtype2 = get_character_types(sent[i+2][0])
postag2 = extract_pos_with_subtype(sent[i+2])
features.extend([
'+2:word=' + word2,
'+2:type=' + chtype2,
'+2:postag=' + postag2,
])
else:
features.append('EOS')
return features
def sent2features(sent):
return [word2features(sent, i) for i in range(len(sent))]
def sent2labels(sent):
return [morph[-1] for morph in sent]
def sent2tokens(sent):
return [morph[0] for morph in sent]
Extract features from sentences with sent2features. The features that are actually extracted are as follows.
>>> sent2features(train_sents[0])[0]
['bias',
'word=2005',
'type=ZDIGIT',
'postag=noun-number',
'BOS',
'BOS',
'+1:word=Year',
'+1:type=OTHER',
'+1:postag=noun-suffix-Classifier',
'+2:word=7',
'+2:type=ZDIGIT',
'+2:postag=noun-number']
It turned out that the features can be extracted from the data. Extract the features and labels for the training and test data from the data for later use.
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]
X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]
To train the model, create a pycrfsuite.Trainer object, load the training data, and then call the train method. First, create a Trainer object and read the training data.
trainer = pycrfsuite.Trainer(verbose=False)
for xseq, yseq in zip(X_train, y_train):
trainer.append(xseq, yseq)
Next, set the learning parameters. Originally, it should be decided using development data, but this time it will be fixed.
trainer.set_params({
'c1': 1.0, # coefficient for L1 penalty
'c2': 1e-3, # coefficient for L2 penalty
'max_iterations': 50, # stop earlier
# include transitions that are possible, but not observed
'feature.possible_transitions': True
})
Now that we're ready, let's train the model. Specify the file name and execute the train method.
trainer.train('model.crfsuite')
When the execution is finished, a file with the specified file name will be created. The trained model is stored in this.
To use the trained model, create a pycrfsuite.Tagger object, load the trained model, and use the tag method. First, create a Tagger object and load the trained model.
tagger = pycrfsuite.Tagger()
tagger.open('model.crfsuite')
Now, let's actually tag the sentence.
example_sent = test_sents[0]
print(' '.join(sent2tokens(example_sent)))
print("Predicted:", ' '.join(tagger.tag(sent2features(example_sent))))
print("Correct: ", ' '.join(sent2labels(example_sent)))
You should get the following result: Predicted is the tag sequence predicted using the created model, and Correct is the correct tag sequence. In the case of this sentence, the expected result of the model and the correct answer data matched.
In October last year, 34 people were killed in an explosion in Taba, Egypt, near the site.
Predicted: B-DAT I-DAT I-DAT O O O O O O O O O O O O B-LOC O B-LOC O O O O O O O O O O
Correct: B-DAT I-DAT I-DAT O O O O O O O O O O O O B-LOC O B-LOC O O O O O O O O O O
This completes the construction of the named entity extractor.
I created a model, but I don't know if this is good or bad. Therefore, it is important to evaluate the model you created. Now let's evaluate the created model. Evaluation is based on precision, recall, and F-number. Below is the code to evaluate.
def bio_classification_report(y_true, y_pred):
lb = LabelBinarizer()
y_true_combined = lb.fit_transform(list(chain.from_iterable(y_true)))
y_pred_combined = lb.transform(list(chain.from_iterable(y_pred)))
tagset = set(lb.classes_) - {'O'}
tagset = sorted(tagset, key=lambda tag: tag.split('-', 1)[::-1])
class_indices = {cls: idx for idx, cls in enumerate(lb.classes_)}
return classification_report(
y_true_combined,
y_pred_combined,
labels = [class_indices[cls] for cls in tagset],
target_names = tagset,
)
Tag statements in the test data set for use in evaluation.
y_pred = [tagger.tag(xseq) for xseq in X_test]
The data tagged using the trained model and the correct answer data are passed to the evaluation function and the result is displayed. For each category, the precision rate, recall rate, F value, and number of tags are displayed.
>>> print(bio_classification_report(y_test, y_pred))
precision recall f1-score support
B-ART 1.00 0.89 0.94 9
I-ART 0.92 1.00 0.96 12
B-DAT 1.00 1.00 1.00 12
I-DAT 1.00 1.00 1.00 22
B-LOC 1.00 0.95 0.97 55
I-LOC 0.94 0.94 0.94 17
B-ORG 0.75 0.86 0.80 14
I-ORG 1.00 0.90 0.95 10
B-PSN 0.00 0.00 0.00 3
B-TIM 1.00 0.71 0.83 7
I-TIM 1.00 0.81 0.90 16
avg / total 0.95 0.91 0.93 177
I think the result is a little too good, but the data used probably contained similar statements.
※Caution You may get an UndefinedMetricWarning. It seems that it is not possible to define the precision rate etc. for labels that do not exist in the predicted sample. Because the number of data prepared is small ...
This time, I was able to easily create a named entity extractor by using the Python library crfsuite. Eight kinds of named entities are added to the tagging based on the definition of IREX. However, the definition of IRX is often rough for practical use. Therefore, if you want to use named entity recognition for some task, you need to prepare the data with the necessary tags according to the task.
You may also want to look for better features and model parameters.
Recommended Posts