Language processing 100 knocks-22: Extraction of category names

Language processing 100 knocks 2015 "Chapter 3: Regular expressions" It is a record of 22nd "Extract category name" of .ac.jp/nlp100/#ch3). This time, we will use ** non-capture target / non-greedy match **. The good thing about this 100 knock is that you can learn the contents little by little.

Reference link

Link Remarks
022.Extraction of category name.ipynb Answer program GitHub link
100 amateur language processing knocks:22 Copy and paste source of many source parts
Python regular expression basics and tips to learn from scratch I organized what I learned in this knock
Regular expression HOWTO Python Official Regular Expression How To
re ---Regular expression operation Python official re package description
Help:Simplified chart Wikipediaの代表的なマークアップのSimplified chart

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
pandas 0.25.3

Chapter 3: Regular Expressions

content of study

By applying regular expressions to the markup description on Wikipedia pages, various information and knowledge can be extracted.

Regular Expressions, JSON, Wikipedia, InfoBox, Web Services

Knock content

File jawiki-country.json.gz that exports Wikipedia articles in the following format There is.

--One article information per line is stored in JSON format --In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. --The entire file is gzipped

Create a program that performs the following processing.

22. Extraction of category name

Extract the article category names (by name, not line by line).

Problem supplement (about "category")

According to Help: Quick Reference, the "category" is [[Category : Help | Hayami Hiyo]] format. Extract the " help "part in this format. In the file, the "category" part is the following data.

Excerpt from the "category" part of the file


[[Category:England|*]]\n'
[[Category:Commonwealth Kingdom|*]]\n'
[[Category:G8 member countries]]\n'
[[Category:European Union member states]]\n'
[[Category:Maritime nation]]\n'
[[Category:Sovereign country]]\n'
[[Category:Island country|Kureito Furiten]]\n'
[[Category:States / Regions Established in 1801]]'

Answer

Answer program [022. Category name extraction.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/03.%E6%AD%A3%E8%A6%8F%E8%A1%A8% E7% 8F% BE / 022.% E3% 82% AB% E3% 83% 86% E3% 82% B4% E3% 83% AA% E5% 90% 8D% E3% 81% AE% E6% 8A% BD % E5% 87% BA.ipynb)

import re

import pandas as pd

def extract_by_title(title):
    df_wiki = pd.read_json('jawiki-country.json', lines=True)
    return df_wiki[(df_wiki['title'] == title)]['text'].values[0]

wiki_body = extract_by_title('England')

#Ignore escape sequences in raw string with r at the beginning
#Ignore line breaks in the middle with triple quotes
# re.Ignore whitespace and comments by using the VERBOSE option
# re.Search for multiple lines with MULTILINE
#Search for short strings by making it a non-greedy match
print(re.findall(r'''
                  ^                  #The beginning of the string(Even if you don't have it, the result will not change, but put it in)
                  \[\[Category:      #Search term(\Is escape processing)
                  (                  #Start grouping for capture
                  .*?                #Non-greedy match for 0 or more arbitrary strings
                  )                  #Completion of capture target grouping
                  (?:                #Start grouping out of capture
                  \|                 #Search term'|'
                  .*                 #Arbitrary character string 0 or more characters
                  )?                 #End of non-capture grouping(0/Appearance target once)
                  \]\]               #Search term(\Is escape processing)
                  $                  #End of string(The result will not change even if you do not have it, but put it in)
                  ''', wiki_body, re.MULTILINE+re.VERBOSE))

Answer commentary

The main part of this time is the following part.

python


print(re.findall(r'''
                  ^                  #The beginning of the string(The result will not change even if you do not have it, but put it in)
                  \[\[Category:      #Search term(\Is escape processing)
                  (                  #Start grouping for capture
                  .*?                #Non-greedy match for 0 or more arbitrary strings
                  )                  #Completion of capture target grouping
                  (?:                #Start grouping out of capture
                  \|                 #Search term'|'
                  .*                 #Arbitrary character string 0 or more characters
                  )?                 #End of non-capture grouping(0/Appearance target once)
                  \]\]               #Search term(\Is escape processing)
                  $                  #End of string(Even if you don't have it, the result will not change, but put it in)
                  ''', wiki_body, re.MULTILINE+re.VERBOSE))

[Not captured by ?: ...](https://qiita.com/FukuharaYohei/items/459f27f0d7bbba551af7#%E3%82%AD%E3%83%A3%E3%83%97%E3%83 % 81% E3% 83% A3% E5% AF% BE% E8% B1% A1% E5% A4% 96)

If you add (?: ...), it will not be included in the search result string ** and will not be captured. This time[[Category:help|Hiyo Hayami]]Formal|Simplified chartI don't want to capture the part, so I'm not capturing it. In the example below, the 4 part is used as a regular expression pattern, but it is not output in the result.

>>> re.findall(r'(.012)(?:4)', 'A0123 B0124 C0123')
['B012']

Greedy / non-greedy match

** You can control the length of the search result target string **. ** A greedy match is a match with the maximum length, and a non-greedy match is a match with the minimum length. The default is greedy match. This time[[Category:help|Hiyo Hayami]]In format|Hiyo HayamiPart of 0/Since it appears once, if you do not make it a non-greedy match, it will be 0 times|Hiyo Hayami]]Will be acquired.

#Greedy match
>>> print(re.findall(r'.0.*2',  'A0123 B0123'))
['A0123 B012']

#Non-greedy match(*After the?)
>>> print(re.findall(r'.0.*?2', 'A0123 B0123'))
['A012', 'B012']

Output result (execution result)

When the program is executed, the following results will be output.

Output result


['England', 'Commonwealth Kingdom', 'G8 member countries', 'European Union member states', 'Maritime nation', 'Sovereign country', 'Island country', 'States / Regions Established in 1801']

Recommended Posts

Language processing 100 knocks-22: Extraction of category names
Language processing 100 knocks-21: Extract lines containing category names
Language processing 100 knocks-46: Extraction of verb case frame information
100 language processing knocks 03 ~ 05
100 language processing knocks (2020): 40
100 language processing knocks (2020): 32
100 language processing knocks (2020): 35
100 language processing knocks (2020): 47
100 language processing knocks (2020): 39
100 language processing knocks (2020): 22
100 language processing knocks (2020): 26
100 language processing knocks (2020): 34
100 language processing knocks (2020): 42
100 language processing knocks (2020): 29
100 language processing knocks (2020): 49
100 language processing knocks 06 ~ 09
100 language processing knocks (2020): 43
100 language processing knocks (2020): 24
100 language processing knocks (2020): 45
100 language processing knocks (2020): 10-19
100 language processing knocks (2020): 30
100 language processing knocks (2020): 00-09
100 language processing knocks (2020): 31
100 language processing knocks (2020): 48
100 language processing knocks (2020): 44
100 language processing knocks (2020): 41
100 language processing knocks (2020): 37
100 language processing knocks (2020): 25
100 language processing knocks (2020): 23
100 language processing knocks (2020): 33
100 language processing knocks (2020): 20
100 language processing knocks (2020): 27
100 language processing knocks (2020): 46
100 language processing knocks (2020): 21
100 language processing knocks (2020): 36
Language processing 100 knocks-48: Extraction of paths from nouns to roots
100 amateur language processing knocks: 41
100 amateur language processing knocks: 71
100 amateur language processing knocks: 56
100 amateur language processing knocks: 24
100 amateur language processing knocks: 59
100 amateur language processing knocks: 70
100 amateur language processing knocks: 60
100 amateur language processing knocks: 92
100 amateur language processing knocks: 30
100 amateur language processing knocks: 06
100 amateur language processing knocks: 84
100 amateur language processing knocks: 81
100 amateur language processing knocks: 33
100 amateur language processing knocks: 40
100 amateur language processing knocks: 45
100 amateur language processing knocks: 43
100 amateur language processing knocks: 55
100 amateur language processing knocks: 22
100 amateur language processing knocks: 61
100 amateur language processing knocks: 94
100 amateur language processing knocks: 54
100 amateur language processing knocks: 04
100 amateur language processing knocks: 63
100 amateur language processing knocks: 78
100 amateur language processing knocks: 12