[Natural language processing] Extract keywords from Kakenhi database with MeCab-ipadic-neologd and termextract

Thank you to all the researchers who are writing the Grants-in-Aid for Scientific Research application. As you know, the research adopted in the past is listed in the Kakenhi Database. However, it is quite difficult to see all of them. Let's get a rough idea of past trends! So, this time, I tried to extract keywords by natural language processing from the outline of research in the Kakenhi database. I am using the morphological analysis package MeCab and the terminology extraction tool termextract.

Environment

Use Python and Jupyter Notebook.

OS etc.

MeCab Refer to here, install MeCab and mecab-python3 for morphological analysis, and set neologd as the standard dictionary. Once installed, try it with bash.

Standard dictionary ipadic (default for MeCab)

bash


echo "Eukaryote" | mecab
True prefix,Noun connection,*,*,*,*,true,Ma,Ma
Nuclear noun,General,*,*,*,*,Nuclear,write,write
Biological noun,General,*,*,*,*,Organism,Saves,Saves
EOS

The default ipadic does not recognize "eukaryotes".

bash


echo "Grant-in-Aid for Scientific Research" | mecab
Scientific nouns,General,*,*,*,*,Science,Science,Science
Research nouns,Change connection,*,*,*,*,the study,Kenkyu,Kenkyu
Expense noun,suffix,General,*,*,*,Expenses,Hi,Hi
Auxiliary noun,Change connection,*,*,*,*,auxiliary,Hojo,Hojo
Gold noun,suffix,General,*,*,*,Money,Kin,Kin
EOS

He didn't even recognize the "Grants-in-Aid for Scientific Research".

Standard dictionary neologd

bash


echo "Eukaryote" | mecab
Eukaryotic noun,Proper noun,General,*,*,*,Eukaryote,Shinkaku Saves,Shinkaku Saves
EOS

neologd has recognized "eukaryotes"! If this is the case, can we expect a little from keyword extraction?

bash


echo "Grant-in-Aid for Scientific Research" | mecab
Scientific nouns,General,*,*,*,*,Science,Science,Science
Research nouns,Change connection,*,*,*,*,the study,Kenkyu,Kenkyu
Expense noun,suffix,General,*,*,*,Expenses,Hi,Hi
Subsidy noun,Proper noun,General,*,*,*,Subsidy,Hojokin,Hojokin
EOS

"Grants-in-Aid for Scientific Research" does not seem to be recognized as one word.

mecab-python Let's try MeCab in Python. I borrowed the first sentence of the data below for testing.

python


import sys
import MeCab
tagger = MeCab.Tagger ("mecabrc")
print(tagger.parse ("Eukaryotes can be broadly divided into Unikont and Bikont."))

Output result


Eukaryotic noun,Proper noun,General,*,*,*,Eukaryote,Shinkaku Saves,Shinkaku Saves
Is a particle,Particle,*,*,*,*,Is,C,Wa
Unikont noun,Proper noun,General,*,*,*,Unikont,Unikont,Unikont
And particles,Parallel particles,*,*,*,*,When,To,To
Bikont noun,Proper noun,General,*,*,*,Bikont,Bikont,Bikont
Particles,Case particles,General,*,*,*,To,D,D
Great noun,Change connection,*,*,*,*,Roughly divided,Taibetsu,Taibetsu
Verbs that can,Independence,*,*,One step,Uninflected word,it can,Dekill,Dekill
.. symbol,Punctuation,*,*,*,*,。,。,。
EOS

I was able to morphologically analyze from Python.

termextract term extract is a package that extracts technical words. You need to pass the data in the form of MeCab analysis results. I installed it referring to here.

Download csv data from Kakenhi database

Finally, we will handle Kakenhi data. At first, I was thinking about scraping with Python, and I was researching various things such as Scraping prohibited, but I realized that I could download it with csv. , I got nothing. I will download all the items with the search word "Chlamydomonas". If you are not familiar with Chlamydomonas, please see here.

Data reading and formatting with pandas

Read the data with pandas and check it. I forgot to specify the encoding, but I could read it without any error.

python


import pandas as pd
kaken = pd.read_csv('kaken.nii.ac.jp_2020-10-23_22-31-59.csv')

Check the first part of the data with kaken.head (). There seems to be a lot of NaN. 2020-10-24 13.21のイメージ.jpg Check the entire data with kaken.info ().

Output result


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 528 entries, 0 to 527
Data columns (total 40 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
0 Research subject name 528 non-null    object 
1 Research subject name(English)        269 non-null    object 
2 Research subject/Region number 528 non-null    object 
3 Research period(year)         528 non-null    object 
4 Principal 471 non-null    object 
5 Research Coordinator 160 non-null    object 
6 Collaborative Researcher 31 non-null     object 
7 Collaborators 20 non-null     object 
8 Research Fellow 53 non-null     object 
9 Foreign Research Fellow 4 non-null      object 
10 Accepted Researcher 4 non-null      object 
11 Keywords 505 non-null    object 
12 Research fields 380 non-null    object 
13 Examination category 102 non-null    object 
14 Research items 528 non-null    object 
15 Research Institute 528 non-null    object 
16 Application category 212 non-null    object 
17 Total allocation amount 526 non-null    float64
18 Total allocation(Direct expenses)       526 non-null    float64
19 Total allocation(Indirect expenses)       249 non-null    float64
20 Allocation amount for each year 526 non-null    object 
21 Allocation amount for each year(Direct expenses)     526 non-null    object 
22 Allocation amount for each year(Indirect expenses)     526 non-null    object 
23 Achievement to date(Classification code)  46 non-null     float64
24 Achievement to date(Classification)     46 non-null     object 
25 Reason 46 non-null     object 
26 Outline of research at the beginning of research 14 non-null     object 
27 Research outline 323 non-null    object 
28 Research outline(English)         156 non-null    object 
29 Outline of research results 85 non-null     object 
30 Outline of research results(English)      85 non-null     object 
31 Outline of research results 84 non-null     object 
32 Achievement to date(Paragraph)     90 non-null     object 
33 Measures to promote future research 94 non-null     object 
34 Next year's research funding plan 0 non-null      float64
35 Reason for the amount used in the next fiscal year 0 non-null      float64
36 Usage plan for next year 0 non-null      float64
37 Free description field 0 non-null      float64
38 Evaluation symbol 3 non-null      object 
39 Remarks 0 non-null      float64
dtypes: float64(9), object(31)
memory usage: 165.1+ KB

It seems that sentences are included in "Summary of research at the beginning of research", "Summary of research", "Summary of research results", and "Summary of research results". There is also a "keyword", but this time I want to extract the keyword from the text, so I will ignore it. Probably because the items to be written have changed from year to year, there are many NaNs and the lines containing the sentences are not aligned. I decided to make a list by extracting only the sentences from the data frame.

python


column_list = ['Outline of research at the beginning of research', 'research summary', 'Outline of research results', 'Outline of research results']
abstracts = []

for column in column_list:
    abstracts.extend(kaken[column].dropna().tolist())

スクリーンショット 2020-10-24 13.36.21.png Ready for morphological analysis. Let's perform morphological analysis on each element of this list.

Morphological analysis with MeCab

With reference to here , I defined a function that returns a list of words as a result of morphological analysis with MeCab. By default, only nouns, verbs, and adjectives are extracted, and verbs and adjectives are restored to their original form.

python


tagger = MeCab.Tagger('')
tagger.parse('')

def wakati_text(text, word_class = ['verb', 'adjective', 'noun']):
    #Separate each node
    node = tagger.parseToNode(text)
    terms = []
    
    while node:
        #word
        term = node.surface
        
        #Part of speech
        pos = node.feature.split(',')[0]

        #If the part of speech matches the condition
        if pos in word_class:
            if pos == 'noun':
                terms.append(term) #Form in the sentence
            else:
                terms.append(node.feature.split(",")[6]) #Put in the prototype

        node = node.next

    return terms

Let's test using a part of the data extracted earlier. スクリーンショット 2020-10-24 18.52.39.png Only nouns, verbs and adjectives can be extracted. ("9 + 2 structure" cannot be extracted ...) Apply the function wakati_text to the entire list `ʻabstracts`` to get a list of nouns, verbs and adjectives.

python


wakati_abstracts = []

for abstract in abstracts:
        wakati_abstracts.extend(wakati_text(abstract))

You now have a list of nouns, verbs, and adjectives. スクリーンショット 2020-10-24 18.57.10.png

Visualization

Count the elements in the list wakati_abstracts and try to make a bar graph from the largest number to the 50th place.

python


import collections
import matplotlib.pyplot as plt
import matplotlib as mpl

words, counts = zip(*collections.Counter(wakati_abstracts).most_common())

mpl.rcParams['font.family'] = 'Noto Sans JP Regular'
plt.figure(figsize=[12, 6])
plt.bar(words[0:50], counts[0:50])
plt.xticks(rotation =90)
plt.ylabel('freq')
plt.savefig('kaken_bar.png', dpi=200, bbox_inches="tight")

kaken_bar.png Since the stop word was not removed, "do", "koto", "reru", "is", "target", etc. are ranked high. In addition to the search word "Chlamydomonas", words familiar to Chlamydomonas related people such as "gene", "light", "cell", "flagella", "protein", and "dynein" are lined up. Didn't you need verbs and adjectives? It is a result that seems to be.

Extraction of nouns only

I tried to extract only nouns by the same procedure as above. Just set the second argument of the function wakati_abstract to ['noun'].

python


noun_abstracts = []

for abstract in abstracts:
        noun_abstracts.extend(wakati_text(abstract, ['noun']))

The code in the middle is the same as above, so I will omit it and show the result of visualization. kaken_bar_noun.png I'm worried that "koto" is in first place and that the numbers "1", "2", and "3" are included, but the result is a little more like a keyword than before.

Terminology extraction using termextract

Next, let's use term extract to extract the jargon. I tried the morphological analysis method with reference to here.

Data shaping

The input format of termextract is the output result of morphological analysis of MeCab. Parse the list `ʻabstracts`` with MeCab and concatenate the parsing results of each element into a format separated by line breaks.

python


#Pass in the form of mecab
mecab_abstracts = []

for abstract in abstracts:
        mecab_abstracts.append(tagger.parse(abstract))

input_text = '/n'.join(mecab_abstracts)

スクリーンショット 2020-10-24 19.17.59.png

Analyze with term extract

The code is almost entirely here.

python


import termextract.mecab
import termextract.core

word_list = []
value_list = []

frequency = termextract.mecab.cmp_noun_dict(input_text)
LR = termextract.core.score_lr(frequency,
         ignore_words=termextract.mecab.IGNORE_WORDS,
         lr_mode=1, average_rate=1
     )
term_imp = termextract.core.term_importance(frequency, LR)

#Sort and output in descending order of importance
data_collection = collections.Counter(term_imp)
for cmp_noun, value in data_collection.most_common():
    word = termextract.core.modify_agglutinative_lang(cmp_noun)
    word_list.append(word)
    value_list.append(value)
    print(word, value, sep="\t")

スクリーンショット 2020-10-24 19.22.03.png I'm not sure what the score means, but I'm seeing those words. Let's visualize this as well.

Visualization

The code is the same as above, so I'll omit it. kaken_bar_termextract.png More likely words such as "photosystem II", "transformant", "flagellar movement", and "gene group" are taken. Isn't it ok that "Chlamydomonas" and "green alga Chlamydomonas" and "dynein" and "axoneme dynein" are different items?

Summary

Keywords were extracted from the search results of the Kakenhi database. Compared to the result of only morphological analysis with MeCab, term extract was able to extract words that are more like keywords.

Bonus: GiNZA

I also tried GiNZA named entity recognition.

python


import spacy
from spacy import displacy

nlp = spacy.load('ja_ginza')
doc = nlp(abstracts[0]) 

#Drawing the result of named entity extraction
displacy.render(doc, style="ent", jupyter=True)

スクリーンショット 2020-10-24 20.21.44.png

It's not a unique expression, so I can't help but I can't get the expressions I want to take, such as "Unikont," "Bikont," "cilia," and "Chlamydomonas." And after all "9 + 2 structure" cannot be taken.

reference

-Prepare an environment where MeCab can be used on Mac -Extract only words with specific part of speech in Python and Mecab -Easy keyword extraction with TermExtract for Python -I tried to extract named entities with the natural language processing library GiNZA -Biological Exercise Machinery Picture Book Chlamydomonas (Swimming Exercise) -9 + 2 structure from ancient times, the mystery of cilia

Recommended Posts

[Natural language processing] Extract keywords from Kakenhi database with MeCab-ipadic-neologd and termextract
3. Natural language processing with Python 2-2. Co-occurrence network [mecab-ipadic-NEologd]
Study natural language processing with Kikagaku
[Natural language processing] Preprocessing with Japanese
Artificial language Lojban and natural language processing (artificial language processing)
Extract database tables with CSV [ODBC connection from R and python]
3. Natural language processing with Python 2-1. Co-occurrence network
3. Natural language processing with Python 1-1. Word N-gram
I tried natural language processing with transformers.
Extract information from business cards by combining Vision API and Natural Language API
I tried to extract named entities with the natural language processing library GiNZA
[Python] I played with natural language processing ~ transformers ~
Let's enjoy natural language processing with COTOHA API
I tried to classify Mr. Habu and Mr. Habu with natural language processing × naive Bayes classifier
3. Natural language processing with Python 4-1. Analysis for words with KWIC
Building an environment for natural language processing with Python
Deep Learning 2 Made from Zero Natural Language Processing 1.3 Summary
Overview of natural language processing and its data preprocessing
Python: Natural language processing
RNN_LSTM2 Natural language processing
Sentiment analysis with natural language processing! I tried to predict the evaluation from the review text
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
Types of preprocessing in natural language processing and their power
3. Natural language processing with Python 5-2. Emotion intensity analysis tool VADER
100 Language Processing with Python Knock 2015
Natural language processing 1 Morphological analysis
Natural language processing 3 Word continuity
Natural language processing 2 Word similarity
Easily build a natural language processing model with BERT + LightGBM + optuna
I tried natural number expression and arithmetic processing only with list processing
[python] Extract text from pdf and read characters aloud with Open-Jtalk
Dockerfile with the necessary libraries for natural language processing in python
Summarize how to preprocess text (natural language processing) with tf.data.Dataset api
Natural Language Processing Case Study: Word Frequency in'Anne with an E'
[python] Extract text from pdf and read characters aloud with Open-Jtalk
Extract database tables with CSV [ODBC connection from R and python]
[Natural language processing] Extract keywords from Kakenhi database with MeCab-ipadic-neologd and termextract
Extract images from cifar and CUCUMBER-9 datasets
Extract Japanese text from PDF with PDFMiner
Extract data from a web page with Python
Extract images and tables from pdf with python to reduce the burden of reporting