Introduction

Language processing 100 knock 2020 has been released, so I will try it immediately.

In Chapter 3, we will extract and format the necessary information using regular expressions from Wikipedia articles.

Wikipedia markup information is Help: quick reference table-Wikipedia, API Information can be found at API: Image Information-MediaWiki. However, since the markup information is incomplete, you can see the data or [Wikipedia page](https://ja.wikipedia.org/wiki/%E3%82%A4%E3%82%AE%E3%83] It is necessary to identify the pattern by looking at% AA% E3% 82% B9).

Chapter 3: Regular Expressions

There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format.

One article information per line is stored in JSON format
In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format.
The entire file is gzipped Create a program that performs the following processing.

20. Read JSON data

Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.

`code`


import gzip
import json

with gzip.open('jawiki-country.json.gz', mode='rt') as f:
    jsons = []
    #Since the original data is JSON Lines ≠ JSON, read line by line
    lines = f.readlines() 
    for line in lines:
        jsons.append(json.loads(line))
    #Extract england
    eng = list(filter(lambda e: e['title'] == 'England', jsons))
    eng_data = eng[0]
    print(eng[0]['text'])

`Output result (part)`


{{redirect|UK}}
{{redirect|United Kingdom|The princes of the Spring and Autumn period|English(Spring and autumn)}}
{{Otheruses|European country|Local cuisine of Nagasaki and Kumamoto prefectures|England}}
{{Basic information Country
|Abbreviated name=England
︙

Note that jawiki-country.json is JSON Lines.

21. Extract rows containing category names

Extract the line that declares the category name in the article.

`code`


import gzip
import json
import regex as re

eng_data = {}
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
    jsons = []
    #Since the original data is JSON Lines ≠ JSON, read line by line
    lines = f.readlines() 
    for line in lines:
        jsons.append(json.loads(line))
    #Extract england
    eng = list(filter(lambda e: e['title'] == 'England', jsons))
    eng_data = eng[0]

#Extract text
texts = eng_data['text'].split('\n')

#Extract rows containing categories
cat_rows = list(filter(lambda e: re.search('\[\[category:|\[\[Category:', e), texts))
print('\n'.join(cat_rows))

`Output result`


[[Category:England|*]]
[[Category:Commonwealth of Nations]]
[[Category:Commonwealth Kingdom|*]]
[[Category:G8 member countries]]
[[Category:European Union member states|Former]]
[[Category:Maritime nation]]
[[Category:Existing sovereign country]]
[[Category:Island country]]
[[Category:A nation / territory established in 1801]]

Help:Simplified chart- WikipediaThen"[[Category:help|HiyoHayami]]Althoughitisactually"[[category:help|HiyoHayami]]Thereisalsoapattern(althoughithappenstobeinaBritisharticle).

22. Extraction of category name

Extract the article category names (by name, not line by line).

`code`


import gzip
import json
import regex as re

eng_data = {}
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
    jsons = []
    #Since the original data is JSON Lines ≠ JSON, read line by line
    lines = f.readlines() 
    for line in lines:
        jsons.append(json.loads(line))
    #Extract england
    eng = list(filter(lambda e: e['title'] == 'England', jsons))
    eng_data = eng[0]

#Extract text
texts = eng_data['text'].split('\n')

#Extract rows containing categories
cat_rows = list(filter(lambda e: re.search('\[\[category:|\[\[Category:', e), texts))

#Extract only category names from rows that contain categories
cat_rows = list(map(lambda e: re.search('(?<=(\[\[category:|\[\[Category:)).+?(?=(\||\]))', e).group(), cat_rows))
print('\n'.join(cat_rows))

`Output result`


England
Commonwealth of Nations
Commonwealth Kingdom
G8 member countries
European Union member states
Maritime nation
Existing sovereign country
Island country
A nation / territory established in 1801

There is a standard re as a regular expression library, but I get a look-behind requires fixed-width pattern error in the look-ahead Use regex for this.

23. Section structure

Display the section name and its level (for example, 1 if "== section name ==") included in the article.

`code`


import json
import re

eng_data = {}
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
    jsons = []
    #Since the original data is JSON Lines ≠ JSON, read line by line
    lines = f.readlines() 
    for line in lines:
        jsons.append(json.loads(line))
    #Extract england
    eng = list(filter(lambda e: e['title'] == 'England', jsons))
    eng_data = eng[0]

#Extract text
texts = eng_data['text'].split('\n')

#Extract lines containing sections
sec_rows = list(filter(lambda e: re.search('==.+==', e), texts))

# =Calculate the level from the number of
sec_rows_num = list(map(lambda e: e + ':' + str(int(e.count('=') / 2 - 1)), sec_rows))

# =And remove whitespace
sections = list(map(lambda e: e.replace('=', '').replace(' ', ''), sec_rows_num))
print('\n'.join(sections))

`Output result (part)`


Country name:1
history:1
Geography:1
Major cities:2
climate:2
︙

24. Extracting file references

Extract all the media files referenced in the article.

`code`


import json
import regex as re

eng_data = {}
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
    jsons = []
    #Since the original data is JSON Lines ≠ JSON, read line by line
    lines = f.readlines() 
    for line in lines:
        jsons.append(json.loads(line))
    #Extract england
    eng = list(filter(lambda e: e['title'] == 'England', jsons))
    eng_data = eng[0]

#Extract text
texts = eng_data['text'].split('\n')

#Extract the line containing the file
file_rows = list(filter(lambda e: re.search('\[\[File:|\[\[File:|\[\[file:', e), texts))

#Extract only the filename from the line containing the file
file_rows = list(map(lambda e: re.search('(?<=(\[\[File:|\[\[File:|\[\[file:)).+?(?=(\||\]))', e).group(), file_rows))
print('\n'.join(file_rows))

`Output result`


Royal Coat of Arms of the United Kingdom.svg
United States Navy Band - God Save the Queen.ogg
Descriptio Prime Tabulae Europae.jpg
Lenepveu, Jeanne d'Arc au siège d'Orléans.jpg
London.bankofengland.arp.jpg
Battle of Waterloo 1815.PNG
Uk topo en.jpg
BenNevis2005.jpg
Population density UK 2011 census.png
2019 Greenwich Peninsula & Canary Wharf.jpg
Leeds CBD at night.jpg
Palace of Westminster, London - Feb 2007.jpg
Scotland Parliament Holyrood.jpg
Donald Trump and Theresa May (33998675310) (cropped).jpg
Soldiers Trooping the Colour, 16th June 2007.jpg
City of London skyline from London City Hall - Oct 2008.jpg
Oil platform in the North SeaPros.jpg
Eurostar at St Pancras Jan 2008.jpg
Heathrow Terminal 5C Iwelumo-1.jpg
UKpop.svg
Anglospeak.svg
Royal Aberdeen Children's Hospital.jpg
CHANDOS3.jpg
The Fabs.JPG
Wembley Stadium, illuminated.jpg

25. Template extraction

Extract the field names and values of the "basic information" template included in the article and store them as a dictionary object.

`code`


import json
import regex as re
import pprint

eng_data = {}
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
    jsons = []
    #Since the original data is JSON Lines ≠ JSON, read line by line
    lines = f.readlines() 
    for line in lines:
        jsons.append(json.loads(line))
    #Extract england
    eng = list(filter(lambda e: e['title'] == 'England', jsons))
    eng_data = eng[0]

#Extract text
text = eng_data['text']

#Extract basic information
basic_text = re.search('{{Basic information[\s\S]+?}}\n\n', text).group().replace('\n*', '*')

#List by line breaks and delete unnecessary parts
basic_ary = basic_text.split('\n')
del basic_ary[0]
del basic_ary[-3:]

#Change to dictionary type
basic_dict = {}
for basic in basic_ary:
    key, *values = basic.split('=')
    key = key.replace(' ', '').replace('|', '')
    basic_dict[key] = ''.join(values).strip()
pprint.pprint(basic_dict)

`Output result (part)`


{'GDP/Man': '36,727<ref name"imf-statistics-gdp" />',
 'GDP value': '2,316.2 billion<ref name"imf-statistics-gdp" />',
 'GDP value MER': '2,433.7 billion<ref name"imf-statistics-gdp" />',
︙
 'Prime Minister's name': '[[Boris Johnson]]',
 'Prime Minister's title': '[[British Prime Minister|Prime Minister]]',
 'capital': '[[London]](infact)'}

26. Removal of highlighted markup

At the time of processing 25, remove MediaWiki's emphasis markup (all weak emphasis, emphasis, strong emphasis) from the template value and convert it to text (reference: markup quick reference table).

`code`


import json
import regex as re
import pprint

eng_data = {}
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
    jsons = []
    #Since the original data is JSON Lines ≠ JSON, read line by line
    lines = f.readlines() 
    for line in lines:
        jsons.append(json.loads(line))
    #Extract england
    eng = list(filter(lambda e: e['title'] == 'England', jsons))
    eng_data = eng[0]

#Extract text
text = eng_data['text']

#Extract basic information
basic_text = re.search('{{Basic information[\s\S]+?}}\n\n', text).group().replace('\n*', '*')

#List by line breaks and delete unnecessary parts
basic_ary = basic_text.split('\n')
del basic_ary[0]
del basic_ary[-3:]

#Change to dictionary type
basic_dict = {}
for basic in basic_ary:
    #Divided into keys and values
    key, *values = basic.split('=')
    #Shape the key
    key = key.replace(' ', '').replace('|', '')
    #Join because the values are listed
    value = ''.join(values).strip()
    #Removal of highlighted markup
    value = value.replace("'''''", '').replace("'''", '').replace("''", '')
    basic_dict[key] = value
pprint.pprint(basic_dict)

`Output result (part)`


{'GDP/Man': '36,727<ref name"imf-statistics-gdp" />',
 'GDP value': '2,316.2 billion<ref name"imf-statistics-gdp" />',
 'GDP value MER': '2,433.7 billion<ref name"imf-statistics-gdp" />',
︙
'Established form 4': 'Changed to the current country name "United Kingdom of Great Britain and Northern Ireland"',
︙
'Prime Minister's name': '[[Boris Johnson]]',
 'Prime Minister's title': '[[British Prime Minister|Prime Minister]]',
 'capital': '[[London]](infact)'}

Help: Quick Reference --Wikipedia "Differentiation from others (italics) ) ”,“ Emphasis (bold) ”and“ Italics and emphasis ”are understood as“ emphasis markup ”here.

27. Removal of internal links

In addition to the 26 processes, remove MediaWiki's internal link markup from the template value and convert it to text (reference: markup quick reference table).

`code`


import json
import regex as re
import pprint

eng_data = {}
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
    jsons = []
    #Since the original data is JSON Lines ≠ JSON, read line by line
    lines = f.readlines() 
    for line in lines:
        jsons.append(json.loads(line))
    #Extract england
    eng = list(filter(lambda e: e['title'] == 'England', jsons))
    eng_data = eng[0]

#Extract text
text = eng_data['text']

#Extract basic information
basic_text = re.search('{{Basic information[\s\S]+?}}\n\n', text).group().replace('\n*', '*')

#List by line breaks and delete unnecessary parts
basic_ary = basic_text.split('\n')
del basic_ary[0]
del basic_ary[-3:]

#Change to dictionary type
basic_dict = {}
for basic in basic_ary:
    #Divided into keys and values
    key, *values = basic.split('=')
    #Shape the key
    key = key.replace(' ', '').replace('|', '')
    #Join because the values are listed
    value = ''.join(values).strip()
    #Removal of highlighted markup
    value = value.replace("'''''", '').replace("'''", '').replace("'", '')
    #Get internal link string
    targets = re.findall('((?<=({{)).+?(?=(}})))', value)
    #Formatting the internal link string
    if targets:
        for target in targets:
            value = re.sub('{{.+?}}', target[0].split('|')[-1], value, count=1)
    #Get internal link string
    targets = re.findall('((?<=(\[\[)).+?(?=(\]\])))', value)
    #Formatting the internal link string
    if targets:
        for target in targets:
            value = re.sub('\[\[.+?\]\]', target[0].split('|')[-1], value, count=1)
    basic_dict[key] = value
pprint.pprint(basic_dict)

`Output result (part)`


{'GDP/Man': '36,727<ref name"imf-statistics-gdp" />',
 'GDP value': '2,316.2 billion<ref name"imf-statistics-gdp" />',
 'GDP value MER': '2,433.7 billion<ref name"imf-statistics-gdp" />',
︙
'Established form 3': 'United Kingdom of Great Britain and Ireland established<br />(1800 Joint Law)',
︙
 'Prime Minister's name': 'Boris Johnson',
 'Prime Minister's title': 'Prime Minister',
 'capital': 'London (virtually)'}

Help:Simplified chart- WikipediaThen "[[Article title|Display character]]There are patterns such as ", but in reality,"{{Article title|Display character}}There seems to be a pattern of ".

28. Removal of MediaWiki markup

In addition to the 27 processes, remove MediaWiki markup from the template values as much as possible and format the basic country information.

`code`


import json
import regex as re
import pprint

eng_data = {}
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
    jsons = []
    #Since the original data is JSON Lines ≠ JSON, read line by line
    lines = f.readlines() 
    for line in lines:
        jsons.append(json.loads(line))
    #Extract england
    eng = list(filter(lambda e: e['title'] == 'England', jsons))
    eng_data = eng[0]

#Extract text
text = eng_data['text']

#Extract basic information
basic_text = re.search('{{Basic information[\s\S]+?}}\n\n', text).group().replace('\n*', '*')

#List by line breaks and delete unnecessary parts
basic_ary = basic_text.split('\n')
del basic_ary[0]
del basic_ary[-3:]

#Change to dictionary type
basic_dict = {}
for basic in basic_ary:
    #Divided into keys and values
    key, *values = basic.split('=')
    #Shape the key
    key = key.replace(' ', '').replace('|', '')
    #Join because the values are listed
    value = ''.join(values).strip()
    #Removal of highlighted markup
    value = value.replace("'''''", '').replace("'''", '').replace("'", '')
    #Get internal link string
    targets = re.findall('((?<=({{)).+?(?=(}})))', value)
    #Formatting the internal link string
    if targets:
        for target in targets:
            value = re.sub('{{.+?}}', target[0].split('|')[-1], value, count=1)
    #Get internal link string
    targets = re.findall('((?<=(\[\[)).+?(?=(\]\])))', value)
    #Formatting the internal link string
    if targets:
        for target in targets:
            value = re.sub('\[\[.+?\]\]', target[0].split('|')[-1], value, count=1)
    #Tag removal
    value = value.replace('<br />', '')
    value = re.sub('<ref.+?</ref>', '', value)
    value = re.sub('<ref.+?/>', '', value)
    basic_dict[key] = value
pprint.pprint(basic_dict)

`Output result (part)`


{'GDP/Man': '36,727',
 'GDP value': '2,316.2 billion',
 'GDP value MER': '2,433.7 billion',
︙
 'Prime Minister's name': 'Boris Johnson',
 'Prime Minister's title': 'Prime Minister',
 'capital': 'London (virtually)'}

29. Get the URL of the national flag image

Use the contents of the template to get the URL of the national flag image. (Hint: Call imageinfo in the MediaWiki API to convert file references to URLs)

`code`


import json
import regex as re
import requests

eng_data = {}
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
    jsons = []
    #Since the original data is JSON Lines ≠ JSON, read line by line
    lines = f.readlines() 
    for line in lines:
        jsons.append(json.loads(line))
    #Extract england
    eng = list(filter(lambda e: e['title'] == 'England', jsons))
    eng_data = eng[0]

#Extract text
text = eng_data['text']

#Extract basic information
basic_text = re.search('{{Basic information[\s\S]+?}}\n\n', text).group().replace('\n*', '*')

#List by line breaks and delete unnecessary parts
basic_ary = basic_text.split('\n')
del basic_ary[0]
del basic_ary[-3:]

#Change to dictionary type
basic_dict = {}
for basic in basic_ary:
    #Divided into keys and values
    key, *values = basic.split('=')
    #Shape the key
    key = key.replace(' ', '').replace('|', '')
    #Join because the values are listed
    value = ''.join(values).strip()
    #Removal of highlighted markup
    value = value.replace("'''''", '').replace("'''", '').replace("'", '')
    #Get internal link string
    targets = re.findall('((?<=({{)).+?(?=(}})))', value)
    #Formatting the internal link string
    if targets:
        for target in targets:
            value = re.sub('{{.+?}}', target[0].split('|')[-1], value, count=1)
    #Get internal link string
    targets = re.findall('((?<=(\[\[)).+?(?=(\]\])))', value)
    #Formatting the internal link string
    if targets:
        for target in targets:
            value = re.sub('\[\[.+?\]\]', target[0].split('|')[-1], value, count=1)
    #Tag removal
    value = value.replace('<br />', '')
    value = re.sub('<ref.+?</ref>', '', value)
    value = re.sub('<ref.+?/>', '', value)
    basic_dict[key] = value

#API call
session = requests.Session()
params = {
    'action': 'query',
    'format': 'json',
    'prop': 'imageinfo',
    'titles': 'File:' + basic_dict['National flag image'],
    'iiprop': 'url'
}

result = session.get('https://ja.wikipedia.org/w/api.php', params=params)
res_json = result.json()
print(res_json['query']['pages']['-1']['imageinfo'][0]['url'])

`Output result`


https://upload.wikimedia.org/wikipedia/commons/a/ae/Flag_of_the_United_Kingdom.svg

in conclusion

What you can learn in Chapter 3

How to write a regular expression
How to call Web API

100 Language Processing Knock 2020 Chapter 3

Introduction

Chapter 3: Regular Expressions

20. Read JSON data

code

Output result (part)

21. Extract rows containing category names

code

Output result

22. Extraction of category name

code

Output result

23. Section structure

code

Output result (part)

24. Extracting file references

code

Output result

25. Template extraction

code

Output result (part)

26. Removal of highlighted markup

code

Output result (part)

27. Removal of internal links

code

Output result (part)

28. Removal of MediaWiki markup

code

Output result (part)

29. Get the URL of the national flag image

code

Output result

in conclusion

`code`

`Output result (part)`

`code`

`Output result`

`code`

`Output result`

`code`

`Output result (part)`

`code`

`Output result`

`code`

`Output result (part)`

`code`

`Output result (part)`

`code`

`Output result (part)`

`code`

`Output result (part)`

`code`

`Output result`