Language processing 100 knocks 2015 "Chapter 3: Regular expressions" It is a record of 22nd "Extract category name" of .ac.jp/nlp100/#ch3). This time, we will use ** non-capture target / non-greedy match **. The good thing about this 100 knock is that you can learn the contents little by little.
Link | Remarks |
---|---|
022.Extraction of category name.ipynb | Answer program GitHub link |
100 amateur language processing knocks:22 | Copy and paste source of many source parts |
Python regular expression basics and tips to learn from scratch | I organized what I learned in this knock |
Regular expression HOWTO | Python Official Regular Expression How To |
re ---Regular expression operation | Python official re package description |
Help:Simplified chart | Wikipediaの代表的なマークアップのSimplified chart |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.15 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.6.9 | python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
pandas | 0.25.3 |
By applying regular expressions to the markup description on Wikipedia pages, various information and knowledge can be extracted.
Regular Expressions, JSON, Wikipedia, InfoBox, Web Services
File jawiki-country.json.gz that exports Wikipedia articles in the following format There is.
--One article information per line is stored in JSON format --In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. --The entire file is gzipped
Create a program that performs the following processing.
Extract the article category names (by name, not line by line).
According to Help: Quick Reference, the "category" is [[Category : Help | Hayami Hiyo]]
format. Extract the " help
"part in this format.
In the file, the "category" part is the following data.
Excerpt from the "category" part of the file
[[Category:England|*]]\n'
[[Category:Commonwealth Kingdom|*]]\n'
[[Category:G8 member countries]]\n'
[[Category:European Union member states]]\n'
[[Category:Maritime nation]]\n'
[[Category:Sovereign country]]\n'
[[Category:Island country|Kureito Furiten]]\n'
[[Category:States / Regions Established in 1801]]'
import re
import pandas as pd
def extract_by_title(title):
df_wiki = pd.read_json('jawiki-country.json', lines=True)
return df_wiki[(df_wiki['title'] == title)]['text'].values[0]
wiki_body = extract_by_title('England')
#Ignore escape sequences in raw string with r at the beginning
#Ignore line breaks in the middle with triple quotes
# re.Ignore whitespace and comments by using the VERBOSE option
# re.Search for multiple lines with MULTILINE
#Search for short strings by making it a non-greedy match
print(re.findall(r'''
^ #The beginning of the string(Even if you don't have it, the result will not change, but put it in)
\[\[Category: #Search term(\Is escape processing)
( #Start grouping for capture
.*? #Non-greedy match for 0 or more arbitrary strings
) #Completion of capture target grouping
(?: #Start grouping out of capture
\| #Search term'|'
.* #Arbitrary character string 0 or more characters
)? #End of non-capture grouping(0/Appearance target once)
\]\] #Search term(\Is escape processing)
$ #End of string(The result will not change even if you do not have it, but put it in)
''', wiki_body, re.MULTILINE+re.VERBOSE))
The main part of this time is the following part.
python
print(re.findall(r'''
^ #The beginning of the string(The result will not change even if you do not have it, but put it in)
\[\[Category: #Search term(\Is escape processing)
( #Start grouping for capture
.*? #Non-greedy match for 0 or more arbitrary strings
) #Completion of capture target grouping
(?: #Start grouping out of capture
\| #Search term'|'
.* #Arbitrary character string 0 or more characters
)? #End of non-capture grouping(0/Appearance target once)
\]\] #Search term(\Is escape processing)
$ #End of string(Even if you don't have it, the result will not change, but put it in)
''', wiki_body, re.MULTILINE+re.VERBOSE))
?: ...
](https://qiita.com/FukuharaYohei/items/459f27f0d7bbba551af7#%E3%82%AD%E3%83%A3%E3%83%97%E3%83 % 81% E3% 83% A3% E5% AF% BE% E8% B1% A1% E5% A4% 96)If you add (?: ...)
, it will not be included in the search result string ** and will not be captured.
This time[[Category:help|Hiyo Hayami]]
Formal|Simplified chart
I don't want to capture the part, so I'm not capturing it.
In the example below, the 4
part is used as a regular expression pattern, but it is not output in the result.
>>> re.findall(r'(.012)(?:4)', 'A0123 B0124 C0123')
['B012']
** You can control the length of the search result target string **. ** A greedy match is a match with the maximum length, and a non-greedy match is a match with the minimum length. The default is greedy match.
This time[[Category:help|Hiyo Hayami]]
In format|Hiyo Hayami
Part of 0/Since it appears once, if you do not make it a non-greedy match, it will be 0 times|Hiyo Hayami]]
Will be acquired.
#Greedy match
>>> print(re.findall(r'.0.*2', 'A0123 B0123'))
['A0123 B012']
#Non-greedy match(*After the?)
>>> print(re.findall(r'.0.*?2', 'A0123 B0123'))
['A012', 'B012']
When the program is executed, the following results will be output.
Output result
['England', 'Commonwealth Kingdom', 'G8 member countries', 'European Union member states', 'Maritime nation', 'Sovereign country', 'Island country', 'States / Regions Established in 1801']
Recommended Posts