Language processing 100 knocks 2015 "Chapter 3: Regular expressions" It is a record of 21st "Extract lines containing category name" of .ac.jp/nlp100/#ch3). The last time was a preparation, and this time we will practice regular expressions. Until now, Gugu uses a lot of basic content that I remembered. Specifically, it is full of basics such as ** raw character string, re.VERBOSE, re.MULTILINE, triple quote **.
Link | Remarks |
---|---|
021.Extract rows containing category names.ipynb | Answer program GitHub link |
100 amateur language processing knocks:21 | Copy and paste source of many source parts |
Python regular expression basics and tips to learn from scratch | I organized what I learned in this knock |
Regular expression HOWTO | Python Official Regular Expression How To |
re ---Regular expression operation | Python official re package description |
Help:Simplified chart | Wikipediaの代表的なマークアップのSimplified chart |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.15 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.6.9 | python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
pandas | 0.25.3 |
By applying regular expressions to the markup description on Wikipedia pages, various information and knowledge can be extracted.
Regular Expressions, JSON, Wikipedia, InfoBox, Web Services
File jawiki-country.json.gz that exports Wikipedia articles in the following format There is.
--One article information per line is stored in JSON format --In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. --The entire file is gzipped
Create a program that performs the following processing.
Extract the line that declares the category name in the article.
According to Help: Quick Reference, the "category name" is [[[ Category: Help | Hayami Hiyo]]
format.
Extract the following part of the file with a regular expression.
Excerpt from the "category name" part of the file
[[Category:England|*]]\n'
[[Category:Commonwealth Kingdom|*]]\n'
[[Category:G8 member countries]]\n'
[[Category:European Union member states]]\n'
[[Category:Maritime nation]]\n'
[[Category:Sovereign country]]\n'
[[Category:Island country|Kureito Furiten]]\n'
[[Category:States / Regions Established in 1801]]'
from pprint import pprint
import re
import pandas as pd
def extract_by_title(title):
df_wiki = pd.read_json('jawiki-country.json', lines=True)
return df_wiki[(df_wiki['title'] == title)]['text'].values[0]
wiki_body = extract_by_title('England')
#Ignore escape sequences in raw string with r at the beginning
#Ignore line breaks in the middle with triple quotes
# re.Ignore whitespace and comments by using the VERBOSE option
# re.Search for multiple lines with MULTILINE
pprint(re.findall(r'''
^ #The beginning of the string(Even if you don't have it, the result will not change, but put it in)
( #Start grouping
.* #Arbitrary character string 0 or more characters
\[\[Category: #Search term(\Is an escape process)
.* #Arbitrary character string 0 or more characters
\]\] #Search term(\Is an escape process)
.* #Arbitrary character string 0 or more characters
) #End of grouping
$ #End of string(Even if you don't have it, the result will not change, but put it in)
''', wiki_body, re.MULTILINE+re.VERBOSE))
The main subject of this knock is as follows.
pprint(re.findall(r'''
^ #The beginning of the string(Even if you don't have it, the result will not change, but put it in)
( #Start grouping
.* #Arbitrary character string 0 or more characters
\[\[Category: #Search term(\Is an escape process)
.* #Arbitrary character string 0 or more characters
\]\] #Search term(\Is an escape process)
.* #Arbitrary character string 0 or more characters
) #End of grouping
$ #End of string(Even if you don't have it, the result will not change, but put it in)
''', wiki_body, re.MULTILINE+re.VERBOSE))
findall
functionThe findall
function ** returns all strings that match the pattern in list format **.
The following example extracts all adverb words that end with ly
( \ w
is "alphanumeric characters and underscores" #% E7% 89% B9% E6% AE% 8A% E6% 96% 87% E5% AD% 97)).
findall example
>>> text = "He was carefully disguised but captured quickly by police."
>>> re.findall(r"\w+ly", text)
['carefully', 'quickly']
Prefix the quotation mark with r
to make it a raw string. You can disable escape sequences by using raw strings. ** If the regular expression pattern has an escape sequence, it is difficult to read, so make it a raw string and invalidate it **.
Raw string print output example
>>> print('a\tb\nA\tB')
a b
A B
>>> print(r'a\tb\nA\tB')
a\tb\nA\tB
You can use line breaks in the regular expression pattern by enclosing them in '''
triple quotes (which can be " ""
). ** Regular by line breaks. Makes the expression pattern easier to read **
Triple quote usage example
a = re.compile(r'''\d +
\.
\d *''')
re.VERBOSE
](https://qiita.com/FukuharaYohei/items/459f27f0d7bbba551af7#%E3%83%88%E3%83%AA%E3%83%97%E3%83%AB%E3%82 % AF% E3% 82% A9% E3% 83% BC% E3% 83% 88% E3% 81% A8reverbose% E3% 81% A7% E6% 94% B9% E8% A1% 8C% E3% 82% B3 % E3% 83% A1% E3% 83% B3% E3% 83% 88% E7% A9% BA% E7% 99% BD% E7% 84% A1% E8% A6% 96)By passing re.VERBOSE
to the parameter flags
, you can use comments and whitespace in the regular expression pattern (no problem if you don't use it). ** Make the regular expression pattern easier to read by inserting a comment and a space **. This is a readability improvement method used in combination with triple quotes.
Triple quote usage example
a = re.compile(r'''\d + # the integral part
\. # the decimal point
\d * # some fractional digits''', re.VERBOSE)
re.MULTILINE
Use this when you want to search for multiple lines individually.
re.MULTILINE usage example
string = r'''\
1st line
2nd line'''
#Search target for multiple lines
print(re.findall(r'^Beginning of line.*', string, re.MULTILINE))
# ['1st line', '2nd line']
#Only the first line is the search target
print(re.findall(r'^Beginning of line.*', string))
# ['1st line']
When the program is executed, the following results will be output.
Output result
['[[Category:England|*]]',
'[[Category:Commonwealth Kingdom|*]]',
'[[Category:G8 member countries]]',
'[[Category:European Union member states]]',
'[[Category:Maritime nation]]',
'[[Category:Sovereign country]]',
'[[Category:Island country|Kureito Furiten]]',
'[[Category:States / Regions Established in 1801]]']
Recommended Posts