Language processing 100 knocks 2015 "Chapter 3: Regular expressions" It is a record of 25th "Extract template" of .ac.jp/nlp100/#ch3). This time, we will deal with the slightly confusing content of ** affirmative look-ahead **. Once you understand it, it's nothing, but is it difficult to understand the words? You will also learn ** DOTALL and ordered dictionaries **. This content is important because it will lead to knocks after Chapter 3.

Reference link

Link	Remarks
025.Template extraction.ipynb	Answer program GitHub link
100 amateur language processing knocks:25	Copy and paste source of many source parts
Python regular expression basics and tips to learn from scratch	I organized what I learned in this knock
Regular expression HOWTO	Python Official Regular Expression How To
re ---Regular expression operation	Python official re package description
Help:Simplified chart	Wikipediaの代表的なマークアップのSimplified chart
Template:Basic information Country	Wikipedia Country Template

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.15	I use pyenv because I sometimes use multiple Python environments
Python	3.6.9	python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type	version
pandas	0.25.3

Chapter 3: Regular Expressions

content of study

By applying regular expressions to the markup description on Wikipedia pages, various information and knowledge can be extracted.

Regular Expressions, JSON, Wikipedia, InfoBox, Web Services

Knock content

File jawiki-country.json.gz that exports Wikipedia articles in the following format There is.

--One article information per line is stored in JSON format --In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. --The entire file is gzipped

Create a program that performs the following processing.

25. Template extraction

Extract the field names and values of the "basic information" template included in the article and store them as a dictionary object.

Problem supplement (about "basic information")

[Template: Basic Information Country](https://ja.wikipedia.org/wiki/Template:%E5%9F%BA%E7%A4%8E%E6%83%85%E5%A0%B1_%E5%9B There is a "basic information" template in% BD), which I referred to. The field names and values in this basic information are extracted with regular expressions.

`Excerpt from the "basic information" part of the file`


{{Basic information Country\n
|Abbreviated name=England\n
|Japanese country name=United Kingdom of Great Britain and Northern Ireland\n

Omission

|International call number= 44\n
|Note= <references />\n
}}\n

Answer

Answer program [025. Template extraction.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/03.%E6%AD%A3%E8%A6%8F%E8%A1%A8%E7 % 8F% BE / 025.% E3% 83% 86% E3% 83% B3% E3% 83% 97% E3% 83% AC% E3% 83% BC% E3% 83% 88% E3% 81% AE% E6% 8A% BD% E5% 87% BA.ipynb)

from collections import OrderedDict
from pprint import pprint
import re

import pandas as pd

def extract_by_title(title):
    df_wiki = pd.read_json('jawiki-country.json', lines=True)
    return df_wiki[(df_wiki['title'] == title)]['text'].values[0]

wiki_body = extract_by_title('England')

basic = re.search(r'''
                    ^\{\{Basic information.*?\n  #Search term(\Is an escape process), Non-capture, non-greedy
                    (.*?)              #Arbitrary string
                    \}\}               #Search term(\Is an escape process)
                    $                  #End of string
                    ''', wiki_body, re.MULTILINE+re.VERBOSE+re.DOTALL)
pprint(basic.group(1))

#Ignore escape sequences in raw string with r at the beginning
#Ignore line breaks in the middle with triple quotes
# re.Ignore whitespace and comments by using the VERBOSE option
templates = OrderedDict(re.findall(r'''
                          ^\|         # \Is escaping, non-capturing
                          (.+?)       #Capture target(key), Non-greedy
                          \s*         #0 or more whitespace characters
                          =           #Search terms, non-capture
                          \s*         #0 or more whitespace characters
                          (.+?)       #Capture target(Value), Non-greedy
                          (?:         #Start a group that is not captured
                            (?=\n\|)  #new line(\n)+'|'In front of(Affirmative look-ahead)
                          | (?=\n$)   #Or a line break(\n)+Before the end(Affirmative look-ahead)
                          )           #End of group not captured
                         ''', basic[0], re.MULTILINE+re.VERBOSE+re.DOTALL))
pprint(templates)

Answer commentary

"Basic information" extraction

First is the "basic information" extraction part. I couldn't bring it to the dictionary type with one regular expression, so I made it in two steps. ^ \ {\ {Basic information starts from" {{Basic information "at the beginning of the line and ends with a line break. Get the character string after line break up to }}. I'm using re.DOTALL to include line breaks in the . wildcard. Until now, regular expressions used the findall function, but since only one place is required, the 'search' function //qiita.com/FukuharaYohei/items/459f27f0d7bbba551af7#match%E3%81%A8search) is used.

`python`


basic = re.search(r'''
                    ^\{\{Basic information.*?\n  #Search term(\Is an escape process), Non-capture, non-greedy
                    (.*?)              #Arbitrary string
                    \}\}               #Search term(\Is an escape process)
                    $                  #End of string
                    ''', wiki_body, re.MULTILINE+re.VERBOSE+re.DOTALL)
pprint(basic.group(1))

The "basic information" part is extracted in the following form.

`"Basic information" extraction result`


('|Abbreviated name=England\n'
 '|Japanese country name=United Kingdom of Great Britain and Northern Ireland\n'
 '|Official country name= {{lang|en|United Kingdom of Great Britain and Northern '
 'Ireland}}<ref>Official country name other than English:<br/>\n'

Omission

 '|ccTLD = [[.uk]] / [[.gb]]<ref>Use is.Overwhelmingly small number compared to uk.</ref>\n'
 '|International call number= 44\n'
 '|Note= <references />\n')

Field name and value extraction

The field names and values are extracted here. The results are made into an ordered dictionary type using collections.OrderDIct (because it is difficult to read at output without order).

`python`


templates = OrderedDict(re.findall(r'''
                          ^\|         # \Is escaping, non-capturing
                          (.+?)       #Capture target(key), Non-greedy
                          \s*         #0 or more whitespace characters
                          =           #Search terms, non-capture
                          \s*         #0 or more whitespace characters
                          (.+?)       #Capture target(Value), Non-greedy
                          (?:         #Start a group that is not captured
                            (?=\n\|)  #new line(\n)+'|'In front of(Affirmative look-ahead)
                          | (?=\n$)   #Or a line break(\n)+Before the end(Affirmative look-ahead)
                          )           #End of group not captured
                         ''', basic.group(1), re.MULTILINE+re.VERBOSE+re.DOTALL))

[Affirmative look-ahead](https://qiita.com/FukuharaYohei/items/459f27f0d7bbba551af7#%E5%85%88%E8%AA%AD%E3%81%BF%E5%BE%8C%E8%AA%AD % E3% 81% BF% E3% 82% A2% E3% 82% B5% E3% 83% BC% E3% 82% B7% E3% 83% A7% E3% 83% B3)

Affirmative look-ahead is a technique in which if the subsequent string is read first and the condition is matched, that part also matches. I know it's hard to understand because I'm writing it. In the first place, this series has the following four.

--Positive Lookahead Assertions --Negative Lookahead Assertions --Positive Lookbehind Assertions --Negative Lookbehind Assertions

The following shape is made into a matrix.

	positive	denial
Look-ahead	`(?=...)` `...`Match if the part continues next	`(?!...)` `...`Match if the part does not follow
Look-ahead	`(?<=...)` `...`Match if the part is before the current position and there is a match	`(?<!...)` `...`Match if the part is before the current position and there is no match

A concrete example is easier to understand than a detailed explanation.

>>> string = 'A01234 B91235 C01234'

#Positive look-ahead assertion(Positive Lookahead Assertions)
# '123'Next to'5'String followed by('(?=5)'Part is the following'.'Do not get without)
>>> print(re.findall(r'..123(?=5).', string))
['B91235']

#Negative look-ahead assertion(Negative Lookahead Assertions)
# '123'Next to'5'String that does not follow('(?!5)'Part is the following'.'Do not get without)
>>> print(re.findall(r'..123(?!5).', string))
['A01234', 'C01234']

#Affirmative look-behind assertion(Positive Lookbehind Assertions)
# '0'But'123'Matching string before('(?<=0)'The part of is the beginning'.'Butなければ取得しない)
>>> print(re.findall(r'..(?<=0)123', string))
['A0123', 'C0123']

#Negative look-ahead assertion(Negative Lookbehind Assertions)
# '0'But'123'String that does not match before('(?<!0)'The part of is the beginning'.'Butなければ取得しない)
>>> print(re.findall(r'..(?<!0)123', string))
['B9123']

Output result (execution result)

When the program is executed, the following result is output at the end.

`Output result`


OrderedDict([('Abbreviated name', 'England'),
             ('Japanese country name', 'United Kingdom of Great Britain and Northern Ireland'),
             ('Official country name',
              '{{lang|en|United Kingdom of Great Britain and Northern '
              'Ireland}}<ref>Official country name other than English:<br/>\n'
              '*{{lang|gd|An Rìoghachd Aonaichte na Breatainn Mhòr agus Eirinn '
              'mu Thuath}}（[[Scottish Gaelic]]）<br/>\n'
              '*{{lang|cy|Teyrnas Gyfunol Prydain Fawr a Gogledd '
              'Iwerddon}}（[[Welsh]]）<br/>\n'
              '*{{lang|ga|Ríocht Aontaithe na Breataine Móire agus Tuaisceart '
              'na hÉireann}}（[[Irish]]）<br/>\n'
              '*{{lang|kw|An Rywvaneth Unys a Vreten Veur hag Iwerdhon '
              'Glédh}}（[[Cornish]]）<br/>\n'
              '*{{lang|sco|Unitit Kinrick o Great Breetain an Northren '
              'Ireland}}（[[Scots]]）<br/>\n'
              '**{{lang|sco|Claught Kängrick o Docht Brätain an Norlin '
              'Airlann}}、{{lang|sco|Unitet Kängdom o Great Brittain an Norlin '
              'Airlann}}(Ulster Scots)</ref>'),
             ('National flag image', 'Flag of the United Kingdom.svg'),
             ('National emblem image',
              '[[File:Royal Coat of Arms of the United '
              'Kingdom.svg|85px|British coat of arms]]'),
             ('National emblem link', '（[[British coat of arms|National emblem]]）'),
             ('Motto', '{{lang|fr|Dieu et mon droit}}<br/>（[[French]]:God and my rights)'),
             ('National anthem', '[[Her Majesty the Queen|God Save the Queen]]'),
             ('Position image', 'Location_UK_EU_Europe_001.svg'),
             ('Official terminology', '[[English]](infact)'),
             ('capital', '[[London]]'),
             ('Largest city', 'London'),
             ('Head of state title', '[[British monarch|Queen]]'),
             ('Name of head of state', '[[Elizabeth II]]'),
             ('Prime Minister's title', '[[British Prime Minister|Prime Minister]]'),
             ('Prime Minister's name', '[[David Cameron]]'),
             ('Area ranking', '76'),
             ('Area size', '1 E11'),
             ('Area value', '244,820'),
             ('Water area ratio', '1.3%'),
             ('Demographic year', '2011'),
             ('Population ranking', '22'),
             ('Population size', '1 E7'),
             ('Population value',
              '63,181,775<ref>[http://esa.un.org/unpd/wpp/Excel-Data/population.htm '
              'United Nations Department of Economic and Social '
              'Affairs>Population Division>Data>Population>Total '
              'Population]</ref>'),
             ('Population density value', '246'),
             ('GDP statistics year yuan', '2012'),
             ('GDP value source',
              '1,547.8 billion<ref '
              'name="imf-statistics-gdp">[http://www.imf.org/external/pubs/ft/weo/2012/02/weodata/weorept.aspx?pr.x=70&pr.y=13&sy=2010&ey=2012&scsm=1&ssd=1&sort=country&ds=.&br=1&c=112&s=NGDP%2CNGDPD%2CPPPGDP%2CPPPPC&grp=0&a= '
              'IMF>Data and Statistics>World Economic Outlook Databases>By '
              'Countrise>United Kingdom]</ref>'),
             ('GDP Statistics Year MER', '2012'),
             ('GDP ranking MER', '5'),
             ('GDP value MER', '2,433.7 billion<ref name="imf-statistics-gdp" />'),
             ('GDP statistical year', '2012'),
             ('GDP ranking', '6'),
             ('GDP value', '2,316.2 billion<ref name="imf-statistics-gdp" />'),
             ('GDP/Man', '36,727<ref name="imf-statistics-gdp" />'),
             ('Founding form', 'Founding of the country'),
             ('Established form 1',
              '[[Kingdom of England]]／[[Kingdom of scotland]]<br />(Both countries[[Acts of Union'
              '(1707)|1707連合法]]Until)'),
             ('Date of establishment 1', '[[927]]／[[843]]'),
             ('Established form 2', '[[Kingdom of Great Britain]]Founding of the country<br />（[[Acts of Union(1707)|1707連合法]]）'),
             ('Date of establishment 2', '[[1707]]'),
             ('Established form 3',
              '[[United Kingdom of Great Britain and Ireland]]Founding of the country<br />（[[Acts of Union(1800)|1800連合法]]）'),
             ('Date of establishment 3', '[[1801]]'),
             ('Established form 4', "Current country name "'''United Kingdom of Great Britain and Northern Ireland'''"change to"),
             ('Date of establishment 4', '[[1927]]'),
             ('currency', '[[Sterling pound|UK pounds]](&pound;)'),
             ('Currency code', 'GBP'),
             ('Time zone', '±0'),
             ('Daylight saving time', '+1'),
             ('ISO 3166-1', 'GB / GBR'),
             ('ccTLD', '[[.uk]] / [[.gb]]<ref>Use is.Overwhelmingly small number compared to uk.</ref>'),
             ('International call number', '44'),
             ('Note', '<references />')])

100 Language Processing Knock-25: Template Extraction

Reference link

environment

Chapter 3: Regular Expressions

content of study

Knock content

25. Template extraction

Problem supplement (about "basic information")

Excerpt from the "basic information" part of the file

Answer

Answer program [025. Template extraction.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/03.%E6%AD%A3%E8%A6%8F%E8%A1%A8%E7 % 8F% BE / 025.% E3% 83% 86% E3% 83% B3% E3% 83% 97% E3% 83% AC% E3% 83% BC% E3% 83% 88% E3% 81% AE% E6% 8A% BD% E5% 87% BA.ipynb)

Answer commentary

"Basic information" extraction

python

"Basic information" extraction result

Field name and value extraction

python

[Affirmative look-ahead](https://qiita.com/FukuharaYohei/items/459f27f0d7bbba551af7#%E5%85%88%E8%AA%AD%E3%81%BF%E5%BE%8C%E8%AA%AD % E3% 81% BF% E3% 82% A2% E3% 82% B5% E3% 83% BC% E3% 82% B7% E3% 83% A7% E3% 83% B3)

Output result (execution result)

Output result

`Excerpt from the "basic information" part of the file`

`python`

`"Basic information" extraction result`

`python`

`Output result`