As of August 18, 2020, there was no article that I tried to apply SIF Rank to Japanese documents, so I will write it down to the point where I actually extract key phrases. I think there are some rough edges, so I would appreciate it if you could point out.
The paper that proposed SIF Rank and the original repository are here. • SIFRank: A New Baseline for Unsupervised Keyphrase Extraction Based on Pre-trained Language Model • sunyilgdx/SIFRank
The code used this time is stored in the following repository. • tanajp/SIFRank_ja_model
• Google Colaboratory • Python 3.6.9 • allennlp 0.8.4 • nltk 3.4.3 • torch 1.2.0 • stanza 1.0.0
First, clone the here repository at hand. Then, download the Japanese version of ELMo from the AllenNLP site and place it under auxiliary_data in the SIF Rank_ja_model folder. (Here, you can download only weights.)
This time, we will place a folder under My Drive and proceed. Please put the cloned folder under My Drive of Google Drive. Next, for the setting of Google Colab, select "Change runtime type" from "Runtime" on the upper left to change the GPU. Please select and save.
from google.colab import drive
drive.mount('/content/drive')
If the output is as follows, it is successful.
Enter your authorization code:
··········
Mounted at /content/drive
Install the required libraries.
!pip install -r '/content/drive/My Drive/SIFRank_ja_model/requirements.txt'
Download wordnet and the Japanese model of stanza.
import nltk
import stanza
nltk.download('wordnet')
stanza.download('ja')
test.py
import sys
sys.path.append('/content/drive/My Drive/SIFRank_ja_model')
sys.path.append('/content/drive/My Drive/SIFRank_ja_model/embeddings')
import stanza
import sent_emb_sif, word_emb_elmo
from model.method import SIFRank, SIFRank_plus
#download from https://allennlp.org/elmo
options_file = "https://exawizardsallenlp.blob.core.windows.net/data/options.json"
weight_file = "/content/drive/My Drive/SIFRank_ja_model/auxiliary_data/weights.hdf5"
ELMO = word_emb_elmo.WordEmbeddings(options_file, weight_file, cuda_device=0)
SIF = sent_emb_sif.SentEmbeddings(ELMO, lamda=1.0)
ja_model = stanza.Pipeline(
lang="ja", processors={}, use_gpu=True
)
elmo_layers_weight = [0.0, 1.0, 0.0]
text = "Please enter the text here."
keyphrases = SIFRank(text, SIF, ja_model, N=5,elmo_layers_weight=elmo_layers_weight)
keyphrases_ = SIFRank_plus(text, SIF, ja_model, N=5, elmo_layers_weight=elmo_layers_weight)
print(keyphrases)
print(keyphrases_)
As an example, [ANA, 500 billion yen capital raising talks-Wikinews](https://ja.wikinews.org/wiki/ANA%E3%80%815000%E5%84%84%E5%86%86 % E8% A6% 8F% E6% A8% A1% E3% 81% AE% E8% B3% 87% E6% 9C% AC% E8% AA% BF% E9% 81% 94% E5% 8D% 94% E8 I tried to extract the key phrase by inputting the text of% AD% B0).
2020-08-17 17:21:13 INFO: Loading these models for language: ja (Japanese):
=======================
| Processor | Package |
-----------------------
| tokenize | gsd |
| pos | gsd |
| lemma | gsd |
| depparse | gsd |
=======================
2020-08-17 17:21:13 INFO: Use device: gpu
2020-08-17 17:21:13 INFO: Loading: tokenize
2020-08-17 17:21:13 INFO: Loading: pos
2020-08-17 17:21:14 INFO: Loading: lemma
2020-08-17 17:21:14 INFO: Loading: depparse
2020-08-17 17:21:15 INFO: Done loading processors!
(['Development Bank of Japan', 'Capital raising', 'ana holdings', 'Loan', 'Private financial institution'], [0.8466373488741734, 0.8303728302151282, 0.7858931046897192, 0.7837600983935882, 0.7821878670623081])
(['Development Bank of Japan', 'Nihon Keizai Shimbun', 'All Nippon Airways', 'Capital raising', 'ana holdings'], [0.8480482653338678, 0.8232344465718657, 0.8218706097094447, 0.8100789955114978, 0.8053839380458278])
This is the result of extracting with N = 5. The final output is the key phrases and their scores. The top is the output result of SIF Rank and the bottom is the output result of SIF Rank +. It is an article that ANA has started discussions on financing with the Development Bank of Japan and private financial institutions, so it seems that the extraction of key phrases is successful.
By the way, tokenize, pos, lemma, and depparse refer to tokenization, POS tagging, heading wording, and dependency structure analysis, respectively, and are in the process of pipeline processing by stanza.
I made SIF Rank applicable to Japanese documents and actually used it. The parser can be anything that has a Japanese model, but this time I used stanza. In addition, the Japanese stopword dictionary uses Slothlib. .. You can edit the stopwords by rewriting japanese_stopwords.txt under auxiliary_data in the SIFRank_ja_model folder.
• SIFRank: A New Baseline for Unsupervised Keyphrase Extraction Based on Pre-trained Language Model • sunyilgdx/SIFRank • AllenNLP • Introduction of ELMo (using MeCab) model that learned large-scale Japanese business news corpus • Usage and accuracy comparison verification of ELMo (using MeCab) model that learned large-scale Japanese business news corpus • Load learned ELMo with AllenNLP-Rittanzu! - • Slothlib
Recommended Posts