Language processing 100 knocks 2015 "Chapter 3: Regular expressions" This is the record of 26th "Removal of emphasized markup" of .ac.jp/nlp100/#ch3). From this time to the 28th, we will remove the markup with regular expressions. This time we will learn removal (replacement) ** using the ** sub function.

Reference link

Link	Remarks
026.Removal of highlighted markup.ipynb	Answer program GitHub link
100 amateur language processing knocks:26	Copy and paste source of many source parts
Python regular expression basics and tips to learn from scratch	I organized what I learned in this knock
Regular expression HOWTO	Python Official Regular Expression How To
re ---Regular expression operation	Python official re package description
Help:Simplified chart	Wikipediaの代表的なマークアップのSimplified chart

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.15	I use pyenv because I sometimes use multiple Python environments
Python	3.6.9	python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type	version
pandas	0.25.3

Chapter 3: Regular Expressions

content of study

By applying regular expressions to the markup description on Wikipedia pages, various information and knowledge can be extracted.

Regular Expressions, JSON, Wikipedia, InfoBox, Web Services

Knock content

File jawiki-country.json.gz that exports Wikipedia articles in the following format There is.

--One article information per line is stored in JSON format --In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. --The entire file is gzipped

Create a program that performs the following processing.

26. Removal of emphasized markup

At the time of processing> 25, remove the MediaWiki emphasis markup (all weak emphasis, emphasis, strong emphasis) from the template value and convert it to text (Reference: [Markup Quick Reference](http: // ja. wikipedia.org/wiki/Help:% E6% 97% A9% E8% A6% 8B% E8% A1% A8)).

Problem supplement (about "emphasized markup")

According to Help: Quick Reference, the "emphasized markup" is as follows: There are three types.

type	Format	Example
Distinguishing from others (italics)	Surround with two single quotes	''Distinguish from others''
Emphasis (bold)	Surround with 3 single quotes	'''Emphasis'''
Italics and emphasis	Surround with 5 single quotes	'''''Italics and emphasis'''''

Extract the following part of the file with a regular expression. It seems that the target this time is only "emphasis (bold)".

`Excerpt from the "emphasized markup" part of the file`


"|Established form 4=Current country name "'''United Kingdom of Great Britain and Northern Ireland'''"change to\n

Answer

Answer Program 026. Elimination of highlighted markup.ipynb

from collections import OrderedDict
from pprint import pprint
import re

import pandas as pd

def extract_by_title(title):
    df_wiki = pd.read_json('jawiki-country.json', lines=True)
    return df_wiki[(df_wiki['title'] == title)]['text'].values[0]

wiki_body = extract_by_title('England')

basic = re.search(r'''
                    ^\{\{Basic information.*?\n  #Search term(\Is an escape process), Non-capture, non-greedy
                    (.*?)              #Arbitrary string
                    \}\}               #Search term(\Is an escape process)
                    $                  #End of string
                    ''', wiki_body, re.MULTILINE+re.VERBOSE+re.DOTALL)

templates = OrderedDict(re.findall(r'''
                          ^\|         # \Is escaping, non-capturing
                          (.+?)       #Capture target(key), Non-greedy
                          \s*         #0 or more whitespace characters
                          =           #Search terms, non-capture
                          \s*         #0 or more whitespace characters
                          (.+?)       #Capture target(Value), Non-greedy
                          (?:         #Start a group that is not captured
                            (?=\n\|)  #new line(\n)+'|'In front of(Affirmative look-ahead)
                          | (?=\n$)   #Or a line break(\n)+Before the end(Affirmative look-ahead)
                          )           #End of group not captured
                         ''', basic.group(1), re.MULTILINE+re.VERBOSE+re.DOTALL))

#Markup removal
def remove_markup(string):
    
    #Removal of highlighted markup
    replaced = re.sub(r'''
                       (\'{2,5})   #2-5'(Start of markup)
                       (.*?)       #Any one or more characters (target character string)
                       (\1)        #Same as the first capture (end of markup)
                       ''', r'\2', string, flags=re.MULTILINE+re.VERBOSE)
    return replaced

for i, (key, value) in enumerate(templates.items()):
    replaced = remove_markup(value)
    templates[key] = replaced
    
    #Show strange things
    if value != replaced:
        print(i, key) 
        print('Change before\t', value)
        print('After change\t', replaced)
        print('----')

pprint(templates)

Answer commentary

The main part of this time is the following part. The sub function is used to remove (replace) "emphasized markup". The fourth argument has count, so you need to specify the name when passing flags (compile flags). I was desperately passing the compile flag without noticing the count, and it didn't work and I wasted about 30 minutes ...

`python`


replaced = re.sub(r'''
                   (\'{2,5})   #2-5'(Start of markup)
                   (.*?)       #Any one or more characters (target character string)
                   (\1)        #Same as the first capture (end of markup)
                   ''', r'\2', string, flags=re.MULTILINE+re.VERBOSE)

The sub function does character substitution. In the order of the arguments: 1. Regular expression pattern, 2. Character string after replacement, 3. Character string to be replaced.

>>> re.sub(r'Replacement target', 'Replaced', 'Replacement target 対象外 Replacement target')
'Replaced Not applicable Replaced'

By the way, the value of "Establishment form 4" is as follows before and after the change.

41 Established form 4
Before change Current country name "'''United Kingdom of Great Britain and Northern Ireland'''"change to
After the change, the current country name is changed to "Great Britain and the United Kingdom of Northern Ireland"
----

Output result (execution result)

When the program is executed, the following results will be output. The remaining markup will be removed at the 27th and 28th.

`Output result`


OrderedDict([('Abbreviated name', 'England'),
             ('Japanese country name', 'United Kingdom of Great Britain and Northern Ireland'),
             ('Official country name',
              '{{lang|en|United Kingdom of Great Britain and Northern '
              'Ireland}}<ref>Official country name other than English:<br/>\n'
              '*{{lang|gd|An Rìoghachd Aonaichte na Breatainn Mhòr agus Eirinn '
              'mu Thuath}}（[[Scottish Gaelic]]）<br/>\n'
              '*{{lang|cy|Teyrnas Gyfunol Prydain Fawr a Gogledd '
              'Iwerddon}}（[[Welsh]]）<br/>\n'
              '*{{lang|ga|Ríocht Aontaithe na Breataine Móire agus Tuaisceart '
              'na hÉireann}}（[[Irish]]）<br/>\n'
              '*{{lang|kw|An Rywvaneth Unys a Vreten Veur hag Iwerdhon '
              'Glédh}}（[[Cornish]]）<br/>\n'
              '*{{lang|sco|Unitit Kinrick o Great Breetain an Northren '
              'Ireland}}（[[Scots]]）<br/>\n'
              '**{{lang|sco|Claught Kängrick o Docht Brätain an Norlin '
              'Airlann}}、{{lang|sco|Unitet Kängdom o Great Brittain an Norlin '
              'Airlann}}(Ulster Scots)</ref>'),
             ('National flag image', 'Flag of the United Kingdom.svg'),
             ('National emblem image',
              '[[File:Royal Coat of Arms of the United '
              'Kingdom.svg|85px|British coat of arms]]'),
             ('National emblem link', '（[[British coat of arms|National emblem]]）'),
             ('Motto', '{{lang|fr|Dieu et mon droit}}<br/>（[[French]]:God and my rights)'),
             ('National anthem', '[[Her Majesty the Queen|God Save the Queen]]'),
             ('Position image', 'Location_UK_EU_Europe_001.svg'),
             ('Official terminology', '[[English]](infact)'),
             ('capital', '[[London]]'),
             ('Largest city', 'London'),
             ('Head of state title', '[[British monarch|Queen]]'),
             ('Name of head of state', '[[Elizabeth II]]'),
             ('Prime Minister's title', '[[British Prime Minister|Prime Minister]]'),
             ('Prime Minister's name', '[[David Cameron]]'),
             ('Area ranking', '76'),
             ('Area size', '1 E11'),
             ('Area value', '244,820'),
             ('Water area ratio', '1.3%'),
             ('Demographic year', '2011'),
             ('Population ranking', '22'),
             ('Population size', '1 E7'),
             ('Population value',
              '63,181,775<ref>[http://esa.un.org/unpd/wpp/Excel-Data/population.htm '
              'United Nations Department of Economic and Social '
              'Affairs>Population Division>Data>Population>Total '
              'Population]</ref>'),
             ('Population density value', '246'),
             ('GDP statistics year yuan', '2012'),
             ('GDP value source',
              '1,547.8 billion<ref '
              'name="imf-statistics-gdp">[http://www.imf.org/external/pubs/ft/weo/2012/02/weodata/weorept.aspx?pr.x=70&pr.y=13&sy=2010&ey=2012&scsm=1&ssd=1&sort=country&ds=.&br=1&c=112&s=NGDP%2CNGDPD%2CPPPGDP%2CPPPPC&grp=0&a= '
              'IMF>Data and Statistics>World Economic Outlook Databases>By '
              'Countrise>United Kingdom]</ref>'),
             ('GDP Statistics Year MER', '2012'),
             ('GDP ranking MER', '5'),
             ('GDP value MER', '2,433.7 billion<ref name="imf-statistics-gdp" />'),
             ('GDP statistical year', '2012'),
             ('GDP ranking', '6'),
             ('GDP value', '2,316.2 billion<ref name="imf-statistics-gdp" />'),
             ('GDP/Man', '36,727<ref name="imf-statistics-gdp" />'),
             ('Founding form', 'Founding of the country'),
             ('Established form 1',
              '[[Kingdom of England]]／[[Kingdom of scotland]]<br />(Both countries[[Acts of Union'
              '(1707)|1707連合法]]Until)'),
             ('Date of establishment 1', '[[927]]／[[843]]'),
             ('Established form 2', '[[Kingdom of Great Britain]]Founding of the country<br />（[[Acts of Union(1707)|1707連合法]]）'),
             ('Date of establishment 2', '[[1707]]'),
             ('Established form 3',
              '[[United Kingdom of Great Britain and Ireland]]Founding of the country<br />（[[Acts of Union(1800)|1800連合法]]）'),
             ('Date of establishment 3', '[[1801]]'),
             ('Established form 4', 'Changed to the current country name "United Kingdom of Great Britain and Northern Ireland"'),
             ('Date of establishment 4', '[[1927]]'),
             ('currency', '[[Sterling pound|UK pounds]](&pound;)'),
             ('Currency code', 'GBP'),
             ('Time zone', '±0'),
             ('Daylight saving time', '+1'),
             ('ISO 3166-1', 'GB / GBR'),
             ('ccTLD', '[[.uk]] / [[.gb]]<ref>Use is.Overwhelmingly small number compared to uk.</ref>'),
             ('International call number', '44'),
             ('Note', '<references />')])

100 Language Processing Knock-26: Removal of emphasized markup