100 Language Processing Knock-28: MediaWiki Markup Removal

Language processing 100 knocks 2015 "Chapter 3: Regular expressions" This is the record of 28th "Removal of MediaWiki markup" of .ac.jp/nlp100/#ch3). This is the end of the markup removal system. There is nothing new to remember, and it is a knock that puts what you have learned into practice.

Reference link

Link Remarks
028.MediaWiki markup removal.ipynb Answer program GitHub link
100 amateur language processing knocks:28 Copy and paste source of many source parts
Python regular expression basics and tips to learn from scratch I organized what I learned in this knock
Regular expression HOWTO Python Official Regular Expression How To
re ---Regular expression operation Python official re package description
Help:Simplified chart Wikipediaの代表的なマークアップのSimplified chart

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
pandas 0.25.3

Chapter 3: Regular Expressions

content of study

By applying regular expressions to the markup description on Wikipedia pages, various information and knowledge can be extracted.

Regular Expressions, JSON, Wikipedia, InfoBox, Web Services

Knock content

File jawiki-country.json.gz that exports Wikipedia articles in the following format There is.

--One article information per line is stored in JSON format --In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. --The entire file is gzipped

Create a program that performs the following processing.

28. Removal of MediaWiki markup

In addition to the> 27 process, remove MediaWiki markup from the template values as much as possible and format the basic country information.

Problem Supplement (About "MediaWiki Markup")

Is it "as much as possible with MediaWiki markup" ... The following parts in the file are extracted with regular expressions.

type Format Reference source
File [[File:Wikipedia-logo-v2-ja.png|thumb|Explanatory text]] Help:Simplified chart
External link [http://www.example.org]
[http://www.example.org display character]
Help:Simplified chart
Template:Lang {{lang|Language tag|String}} Template:Lang
HTML tags <tag> None

Answer

Answer Program [028.MediaWiki Markup Removal.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/03.%E6%AD%A3%E8%A6%8F%E8%A1%A8 % E7% 8F% BE / 028.MediaWiki% E3% 83% 9E% E3% 83% BC% E3% 82% AF% E3% 82% A2% E3% 83% 83% E3% 83% 97% E3% 81 % AE% E9% 99% A4% E5% 8E% BB.ipynb)

from collections import OrderedDict
from pprint import pprint
import re

import pandas as pd

def extract_by_title(title):
    df_wiki = pd.read_json('jawiki-country.json', lines=True)
    return df_wiki[(df_wiki['title'] == title)]['text'].values[0]

wiki_body = extract_by_title('England')

basic = re.search(r'''
                    ^\{\{Basic information.*?\n  #Search term(\Is an escape process), Non-capture, non-greedy
                    (.*?)              #Arbitrary string
                    \}\}               #Search term(\Is an escape process)
                    $                  #End of string
                    ''', wiki_body, re.MULTILINE+re.VERBOSE+re.DOTALL)

templates = OrderedDict(re.findall(r'''
                          ^\|         # \Is escaping, non-capturing
                          (.+?)       #Capture target(key), Non-greedy
                          \s*         #0 or more whitespace characters
                          =           #Search terms, non-capture
                          \s*         #0 or more whitespace characters
                          (.+?)       #Capture target(Value), Non-greedy
                          (?:         #Start a group that is not captured
                            (?=\n\|)  #new line(\n)+'|'In front of(Affirmative look-ahead)
                          | (?=\n$)   #Or a line break(\n)+Before the end(Affirmative look-ahead)
                          )           #End of group not captured
                         ''', basic.group(1), re.MULTILINE+re.VERBOSE+re.DOTALL))

#Markup removal
def remove_markup(string):
    
    #Removal of highlighted markup
    #Removal target:''Distinguish from others''、'''Emphasis'''、'''''斜体とEmphasis'''''
    replaced = re.sub(r'''
                       (\'{2,5})   #2-5'(Start of markup)
                       (.*?)       #Any one or more characters (target character string)
                       (\1)        #Same as the first capture (end of markup)
                       ''', r'\2', string, flags=re.MULTILINE+re.VERBOSE)

    #Removal of internal link files
    #Removal target:[[Article title]]、[[Article title|Display character]]、[[Article title#Section name|Display character]]、[[File:Wi.png|thumb|Explanatory text]]
    replaced = re.sub(r'''
        \[\[             # '[['(Markup start)
        (?:              #Start a group that is not captured
            [^|]*?       # '|'Characters other than 0 characters or more, non-greedy
            \|           # '|'
        )*?              #Group end, this group appears 0 or more, non-greedy(Changes from No27)
        (                #Group start, capture target
          (?!Category:)  #Negative look-ahead(If it is included, it is excluded.)
          ([^|]*?)    # '|'Other than 0 characters, non-greedy(Character string to be displayed)
        )
        \]\]        # ']]'(Markup finished)
        ''', r'\1', replaced, flags=re.MULTILINE+re.VERBOSE)

    # Template:Removal of Lang
    #Removal target:{{lang|Language tag|String}}
    replaced = re.sub(r'''
        \{\{lang    # '{{lang'(Markup start)
        (?:         #Start a group that is not captured
            [^|]*?  # '|'0 or more characters other than, non-greedy
            \|      # '|'
        )*?         #Group end, this group appears 0 or more, non-greedy
        ([^|]*?)    #Capture target,'|'Other than 0 characters, non-greedy(Character string to be displayed)
        \}\}        # '}}'(Markup finished)
        ''', r'\1', replaced, flags=re.MULTILINE+re.VERBOSE)
    
    #Removal of external links
    #Target to be removed[http(s)://xxxx] 、[http(s)://xxx xxx]
    replaced = re.sub(r'''
        \[https?:// # '[http://'(Markup start)
        (?:           #Start a group that is not captured
            [^\s]*? #Zero or more non-blank characters, non-greedy
            \s      #Blank
        )?          #Group ends, this group appears 0 or 1
        ([^]]*?)    #Capture target,']'Other than 0 characters, non-greedy (character string to be displayed)
        \]          # ']'(End of markup)
        ''', r'\1', replaced, flags=re.MULTILINE+re.VERBOSE)

    #HTML tag removal
    #Target to be removed<xx> </xx> <xx/>
    replaced = re.sub(r'''
        <           # '<'(Start markup)
        .+?         #1 or more characters, non-greedy
        >           # '>'(End of markup)
        ''', '', replaced, flags=re.MULTILINE+re.VERBOSE)

    return replaced

for i, (key, value) in enumerate(templates.items()):
    replaced = remove_markup(value)
    templates[key] = replaced
    
    #Show strange things
    if value != replaced:
        print(i, key) 
        print('Change before\t', value)
        print('After change\t', replaced)
        print('----')

pprint(templates)

Answer commentary

"File" removal

Since it is almost the same as the "internal link" removal of the previous knock, the corresponding partial regular expression is corrected. Specifically, the end of the first group below)??From)*?It is changing to. For files[[File:Wikipedia-logo-v2-ja.png|thumb|Explanatory text]]When|Appears more than once, so I set it to 0 or more.

python


replaced = re.sub(r'''
    \[\[             # '[['(Markup start)
    (?:              #Start a group that is not captured
        [^|]*?       # '|'Characters other than 0 characters or more, non-greedy
        \|           # '|'
    )*?              #Group end, this group appears 0 or more, non-greedy(Changes from No27)
    (                #Group start, capture target
      (?!Category:)  #Negative look-ahead(If it is included, it is excluded.)
      ([^|]*?)    # '|'Other than 0 characters, non-greedy(Character string to be displayed)
    )
    \]\]        # ']]'(Markup finished)
    ''', r'\1', replaced, flags=re.MULTILINE+re.VERBOSE)

Below are the files before and after the file removal change.

4 National emblem image
Change before[[File:Royal Coat of Arms of the United Kingdom.svg|85px|British coat of arms]]
After the change, the coat of arms of the United Kingdom

Remove "Template: Lang"

{{lang|Language tag|String}}形式を「String」のみに置換しています。

python


#Removal target:{{lang|Language tag|String}}
replaced = re.sub(r'''
    \{\{lang    # '{{lang'(Markup start)
    (?:         #Start a group that is not captured
        [^|]*?  # '|'0 or more characters other than, non-greedy
        \|      # '|'
    )*?         #Group end, this group appears 0 or more, non-greedy
    ([^|]*?)    #Capture target,'|'Other than 0 characters, non-greedy(Character string to be displayed)
    \}\}        # '}}'(Markup finished)
    ''', r'\1', replaced, flags=re.MULTILINE+re.VERBOSE)

Below is the result of removing "Template: Lang".

2 Official country name
Change before{{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br/>
*{{lang|gd|An Rìoghachd Aonaichte na Breatainn Mhòr agus Eirinn mu Thuath}}([[Scottish Gaelic]])<br/>
*{{lang|cy|Teyrnas Gyfunol Prydain Fawr a Gogledd Iwerddon}}([[Welsh]])<br/>
*{{lang|ga|Ríocht Aontaithe na Breataine Móire agus Tuaisceart na hÉireann}}([[Irish]])<br/>
*{{lang|kw|An Rywvaneth Unys a Vreten Veur hag Iwerdhon Glédh}}([[Cornish]])<br/>
*{{lang|sco|Unitit Kinrick o Great Breetain an Northren Ireland}}([[Scots]])<br/>
**{{lang|sco|Claught Kängrick o Docht Brätain an Norlin Airlann}}、{{lang|sco|Unitet Kängdom o Great Brittain an Norlin Airlann}}(Ulster Scots)</ref>
After change United Kingdom of Great Britain and Northern Ireland Official country name in non-English:
*An Rìoghachd Aonaichte na Breatainn Mhòr agus Eirinn mu Thuath (Scottish Gaelic)
*Teyrnas Gyfunol Prydain Fawr a Gogledd Iwerddon (Welsh)
*Ríocht Aontai the na Breataine Móire agus Tuaisceart na hÉireann (Irish)
*An Rywvaneth Unys a Vreten Veur hag Iwerdhon Glédh (Cornish)
*Unitit Kinrick o Great Breetain an Northren Ireland (Scots)
**Claught Kängrick o Docht Brätain an Norlin Airlann, Unitet Kängdom o Great Brittain an Norlin Airlann (Ulster Scots)
----
6 slogans
Change before{{lang|fr|Dieu et mon droit}}<br/>([[French]]:God and my rights)
After change Dieu et mon droit (French):God and my rights)

"External link" removal

It is removed including https.

python


#Target to be removed[http(s)://xxxx] 、[http(s)://xxx xxx]
replaced = re.sub(r'''
    \[https?:// # '[http://'(Markup start)
    (?:           #Start a group that is not captured
        [^\s]*? #Zero or more non-blank characters, non-greedy
        \s      #Blank
    )?          #Group ends, this group appears 0 or 1
    ([^]]*?)    #Capture target,']'Other than 0 characters, non-greedy (character string to be displayed)
    \]          # ']'(End of markup)
    ''', r'\1', replaced, flags=re.MULTILINE+re.VERBOSE)

The following is the external link removal part.

23 Population value
Before change 63,181,775<ref>[http://esa.un.org/unpd/wpp/Excel-Data/population.htm United Nations Department of Economic and Social Affairs>Population Division>Data>Population>Total Population]</ref>
After change 63,181,775United Nations Department of Economic and Social Affairs>Population Division>Data>Population>Total Population
----
26 GDP source
Before change 1.5478 trillion<ref name="imf-statistics-gdp">[http://www.imf.org/external/pubs/ft/weo/2012/02/weodata/weorept.aspx?pr.x=70&pr.y=13&sy=2010&ey=2012&scsm=1&ssd=1&sort=country&ds=.&br=1&c=112&s=NGDP%2CNGDPD%2CPPPGDP%2CPPPPC&grp=0&a= IMF>Data and Statistics>World Economic Outlook Databases>By Countrise>United Kingdom]</ref>
1.5478 trillion IMF after change>Data and Statistics>World Economic Outlook Databases>By Countrise>United Kingdom

"HTML tag" removal

HTML tags are also removed. A simple regular expression that says "if you're surrounded by <and> ".

python


#Target to be removed<xx> </xx> <xx/>
replaced = re.sub(r'''
    <           # '<'(Start markup)
    .+?         #1 or more characters, non-greedy
    >           # '>'(End of markup)
    ''', '', replaced, flags=re.MULTILINE+re.VERBOSE)

Below are the results. There are quite a lot.

2 Official country name
Change before{{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br/>
*{{lang|gd|An Rìoghachd Aonaichte na Breatainn Mhòr agus Eirinn mu Thuath}}([[Scottish Gaelic]])<br/>
*{{lang|cy|Teyrnas Gyfunol Prydain Fawr a Gogledd Iwerddon}}([[Welsh]])<br/>
*{{lang|ga|Ríocht Aontaithe na Breataine Móire agus Tuaisceart na hÉireann}}([[Irish]])<br/>
*{{lang|kw|An Rywvaneth Unys a Vreten Veur hag Iwerdhon Glédh}}([[Cornish]])<br/>
*{{lang|sco|Unitit Kinrick o Great Breetain an Northren Ireland}}([[Scots]])<br/>
**{{lang|sco|Claught Kängrick o Docht Brätain an Norlin Airlann}}、{{lang|sco|Unitet Kängdom o Great Brittain an Norlin Airlann}}(Ulster Scots)</ref>
After change United Kingdom of Great Britain and Northern Ireland Official country name in non-English:
*An Rìoghachd Aonaichte na Breatainn Mhòr agus Eirinn mu Thuath (Scottish Gaelic)
*Teyrnas Gyfunol Prydain Fawr a Gogledd Iwerddon (Welsh)
*Ríocht Aontai the na Breataine Móire agus Tuaisceart na hÉireann (Irish)
*An Rywvaneth Unys a Vreten Veur hag Iwerdhon Glédh (Cornish)
*Unitit Kinrick o Great Breetain an Northren Ireland (Scots)
**Claught Kängrick o Docht Brätain an Norlin Airlann, Unitet Kängdom o Great Brittain an Norlin Airlann (Ulster Scots)
----
6 slogans
Change before{{lang|fr|Dieu et mon droit}}<br/>([[French]]:God and my rights)
After change Dieu et mon droit (French):God and my rights)
----
23 Population value
Before change 63,181,775<ref>[http://esa.un.org/unpd/wpp/Excel-Data/population.htm United Nations Department of Economic and Social Affairs>Population Division>Data>Population>Total Population]</ref>
After change 63,181,775United Nations Department of Economic and Social Affairs>Population Division>Data>Population>Total Population
----
26 GDP source
Before change 1.5478 trillion<ref name="imf-statistics-gdp">[http://www.imf.org/external/pubs/ft/weo/2012/02/weodata/weorept.aspx?pr.x=70&pr.y=13&sy=2010&ey=2012&scsm=1&ssd=1&sort=country&ds=.&br=1&c=112&s=NGDP%2CNGDPD%2CPPPGDP%2CPPPPC&grp=0&a= IMF>Data and Statistics>World Economic Outlook Databases>By Countrise>United Kingdom]</ref>
1.5478 trillion IMF after change>Data and Statistics>World Economic Outlook Databases>By Countrise>United Kingdom
----
29 GDP value MER
Before change 2,433.7 billion<ref name="imf-statistics-gdp" />
After change 2,433.7 billion
----
32 GDP value
Before change 2,316.2 billion<ref name="imf-statistics-gdp" />
After change 2,316.2 billion
----
33 GDP/Man
Before change 36,727<ref name="imf-statistics-gdp" />
After change 36,727
----
37 Established form 2
Change before[[Kingdom of Great Britain]]Founding of the country<br />([[Acts of Union(1707)|1707連合法]])
After the change The Kingdom of Great Britain was founded (Acts of Union 1707)
----
39 Established form 3
Change before[[United Kingdom of Great Britain and Ireland]]Founding of the country<br />([[Acts of Union(1800)|1800連合法]])
After the change The United Kingdom of Great Britain and Ireland was founded (Act of Union 1800)
----
48 ccTLD
Change before[[.uk]] / [[.gb]]<ref>Use is.Overwhelmingly small number compared to uk.</ref>
After change.uk / .gb use.Overwhelmingly small number compared to uk.
----
50 Note
Change before<references />
After change

Output result (execution result)

When the program is executed, the following result is output at the end. I feel refreshed.

Output result


OrderedDict([('Abbreviated name', 'England'),
             ('Japanese country name', 'United Kingdom of Great Britain and Northern Ireland'),
             ('Official country name',
              'United Kingdom of Great Britain and Northern '
              'Ireland Official country name in non-English:\n'
              '*An Rìoghachd Aonaichte na Breatainn Mhòr agus Eirinn mu '
              'Thuath (Scottish Gaelic)\n'
              '*Teyrnas Gyfunol Prydain Fawr a Gogledd Iwerddon (Welsh)\n'
              '*Ríocht Aontaithe na Breataine Móire agus Tuaisceart na '
              'hÉireann (Irish)\n'
              '*An Rywvaneth Unys a Vreten Veur hag Iwerdhon Glédh (Cornish)\n'
              '*Unitit Kinrick o Great Breetain an Northren Ireland (Scots)\n'
              '**Claught Kängrick o Docht Brätain an Norlin Airlann、Unitet '
              'Käng do m o Great Brittain an Norlin Airlann (Ulster Scots)'),
             ('National flag image', 'Flag of the United Kingdom.svg'),
             ('National emblem image', 'British coat of arms'),
             ('National emblem link', '(National emblem)'),
             ('Motto', 'Dieu et mon droit (French:God and my rights)'),
             ('National anthem', 'God Save the Queen'),
             ('Position image', 'Location_UK_EU_Europe_001.svg'),
             ('Official terminology', 'English (virtually)'),
             ('capital', 'London'),
             ('Largest city', 'London'),
             ('Head of state title', 'Queen'),
             ('Name of head of state', 'Elizabeth II'),
             ('Prime Minister's title', 'Prime Minister'),
             ('Prime Minister's name', 'David Cameron'),
             ('Area ranking', '76'),
             ('Area size', '1 E11'),
             ('Area value', '244,820'),
             ('Water area ratio', '1.3%'),
             ('Demographic year', '2011'),
             ('Population ranking', '22'),
             ('Population size', '1 E7'),
             ('Population value',
              '63,181,775United Nations Department of Economic and Social '
              'Affairs>Population Division>Data>Population>Total Population'),
             ('Population density value', '246'),
             ('GDP statistics year yuan', '2012'),
             ('GDP value source',
              '1,547.8 billion IMF>Data and Statistics>World Economic Outlook '
              'Databases>By Countrise>United Kingdom'),
             ('GDP Statistics Year MER', '2012'),
             ('GDP ranking MER', '5'),
             ('GDP value MER', '2,433.7 billion'),
             ('GDP statistical year', '2012'),
             ('GDP ranking', '6'),
             ('GDP value', '2,316.2 billion'),
             ('GDP/Man', '36,727'),
             ('Founding form', 'Founding of the country'),
             ('Established form 1', 'Kingdom of England / Kingdom of Scotland (both until the Act of Union 1707)'),
             ('Date of establishment 1', '927/843'),
             ('Established form 2', 'Founding of the Kingdom of Great Britain (Acts of Union 1707)'),
             ('Date of establishment 2', '1707'),
             ('Established form 3', 'United Kingdom of Great Britain and Ireland founded (Act of Union 1800)'),
             ('Date of establishment 3', '1801'),
             ('Established form 4', 'Changed to the current country name "United Kingdom of Great Britain and Northern Ireland"'),
             ('Date of establishment 4', '1927'),
             ('currency', 'UK pounds(&pound;)'),
             ('Currency code', 'GBP'),
             ('Time zone', '±0'),
             ('Daylight saving time', '+1'),
             ('ISO 3166-1', 'GB / GBR'),
             ('ccTLD', '.uk / .gb use.Overwhelmingly small number compared to uk.'),
             ('International call number', '44'),
             ('Note', '')])

Recommended Posts

100 Language Processing Knock-28: MediaWiki Markup Removal
100 Language Processing Knock-26: Removal of emphasized markup
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
100 language processing knock-55: named entity extraction
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock-82 (Context Word): Context Extraction
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
Language processing 100 knock-86: Word vector display
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
100 Language Processing Knock-59: Analysis of S-expressions
Python beginner tried 100 language processing knock 2015 (05 ~ 09)
100 Language Processing Knock-31 (using pandas): Verb
100 language processing knock 2020 "for Google Colaboratory"
I tried 100 language processing knock 2020: Chapter 1
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
100 language processing knock-73 (using scikit-learn): learning
100 Language Processing Knock Chapter 1 by Python
100 Language Processing Knock 2020 Chapter 3: Regular Expressions