Language processing 100 knock 2020 has been released, so I will try it immediately.
In Chapter 3, we will extract and format the necessary information using regular expressions from Wikipedia articles.
Wikipedia markup information is Help: quick reference table-Wikipedia, API Information can be found at API: Image Information-MediaWiki. However, since the markup information is incomplete, you can see the data or [Wikipedia page](https://ja.wikipedia.org/wiki/%E3%82%A4%E3%82%AE%E3%83] It is necessary to identify the pattern by looking at% AA% E3% 82% B9).
There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format.
Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.
code
import gzip
import json
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
jsons = []
#Since the original data is JSON Lines ≠ JSON, read line by line
lines = f.readlines()
for line in lines:
jsons.append(json.loads(line))
#Extract england
eng = list(filter(lambda e: e['title'] == 'England', jsons))
eng_data = eng[0]
print(eng[0]['text'])
Output result (part)
{{redirect|UK}}
{{redirect|United Kingdom|The princes of the Spring and Autumn period|English(Spring and autumn)}}
{{Otheruses|European country|Local cuisine of Nagasaki and Kumamoto prefectures|England}}
{{Basic information Country
|Abbreviated name=England
︙
Note that jawiki-country.json
is JSON Lines.
Extract the line that declares the category name in the article.
code
import gzip
import json
import regex as re
eng_data = {}
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
jsons = []
#Since the original data is JSON Lines ≠ JSON, read line by line
lines = f.readlines()
for line in lines:
jsons.append(json.loads(line))
#Extract england
eng = list(filter(lambda e: e['title'] == 'England', jsons))
eng_data = eng[0]
#Extract text
texts = eng_data['text'].split('\n')
#Extract rows containing categories
cat_rows = list(filter(lambda e: re.search('\[\[category:|\[\[Category:', e), texts))
print('\n'.join(cat_rows))
Output result
[[Category:England|*]]
[[Category:Commonwealth of Nations]]
[[Category:Commonwealth Kingdom|*]]
[[Category:G8 member countries]]
[[Category:European Union member states|Former]]
[[Category:Maritime nation]]
[[Category:Existing sovereign country]]
[[Category:Island country]]
[[Category:A nation / territory established in 1801]]
Help:Simplified chart- WikipediaThen"[[Category:help|HiyoHayami]]Althoughitisactually"[[category:help|HiyoHayami]]Thereisalsoapattern(althoughithappenstobeinaBritisharticle).
Extract the article category names (by name, not line by line).
code
import gzip
import json
import regex as re
eng_data = {}
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
jsons = []
#Since the original data is JSON Lines ≠ JSON, read line by line
lines = f.readlines()
for line in lines:
jsons.append(json.loads(line))
#Extract england
eng = list(filter(lambda e: e['title'] == 'England', jsons))
eng_data = eng[0]
#Extract text
texts = eng_data['text'].split('\n')
#Extract rows containing categories
cat_rows = list(filter(lambda e: re.search('\[\[category:|\[\[Category:', e), texts))
#Extract only category names from rows that contain categories
cat_rows = list(map(lambda e: re.search('(?<=(\[\[category:|\[\[Category:)).+?(?=(\||\]))', e).group(), cat_rows))
print('\n'.join(cat_rows))
Output result
England
Commonwealth of Nations
Commonwealth Kingdom
G8 member countries
European Union member states
Maritime nation
Existing sovereign country
Island country
A nation / territory established in 1801
There is a standard re as a regular expression library, but I get a look-behind requires fixed-width pattern
error in the look-ahead Use regex for this.
Display the section name and its level (for example, 1 if "== section name ==") included in the article.
code
import json
import re
eng_data = {}
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
jsons = []
#Since the original data is JSON Lines ≠ JSON, read line by line
lines = f.readlines()
for line in lines:
jsons.append(json.loads(line))
#Extract england
eng = list(filter(lambda e: e['title'] == 'England', jsons))
eng_data = eng[0]
#Extract text
texts = eng_data['text'].split('\n')
#Extract lines containing sections
sec_rows = list(filter(lambda e: re.search('==.+==', e), texts))
# =Calculate the level from the number of
sec_rows_num = list(map(lambda e: e + ':' + str(int(e.count('=') / 2 - 1)), sec_rows))
# =And remove whitespace
sections = list(map(lambda e: e.replace('=', '').replace(' ', ''), sec_rows_num))
print('\n'.join(sections))
Output result (part)
Country name:1
history:1
Geography:1
Major cities:2
climate:2
︙
Extract all the media files referenced in the article.
code
import json
import regex as re
eng_data = {}
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
jsons = []
#Since the original data is JSON Lines ≠ JSON, read line by line
lines = f.readlines()
for line in lines:
jsons.append(json.loads(line))
#Extract england
eng = list(filter(lambda e: e['title'] == 'England', jsons))
eng_data = eng[0]
#Extract text
texts = eng_data['text'].split('\n')
#Extract the line containing the file
file_rows = list(filter(lambda e: re.search('\[\[File:|\[\[File:|\[\[file:', e), texts))
#Extract only the filename from the line containing the file
file_rows = list(map(lambda e: re.search('(?<=(\[\[File:|\[\[File:|\[\[file:)).+?(?=(\||\]))', e).group(), file_rows))
print('\n'.join(file_rows))
Output result
Royal Coat of Arms of the United Kingdom.svg
United States Navy Band - God Save the Queen.ogg
Descriptio Prime Tabulae Europae.jpg
Lenepveu, Jeanne d'Arc au siège d'Orléans.jpg
London.bankofengland.arp.jpg
Battle of Waterloo 1815.PNG
Uk topo en.jpg
BenNevis2005.jpg
Population density UK 2011 census.png
2019 Greenwich Peninsula & Canary Wharf.jpg
Leeds CBD at night.jpg
Palace of Westminster, London - Feb 2007.jpg
Scotland Parliament Holyrood.jpg
Donald Trump and Theresa May (33998675310) (cropped).jpg
Soldiers Trooping the Colour, 16th June 2007.jpg
City of London skyline from London City Hall - Oct 2008.jpg
Oil platform in the North SeaPros.jpg
Eurostar at St Pancras Jan 2008.jpg
Heathrow Terminal 5C Iwelumo-1.jpg
UKpop.svg
Anglospeak.svg
Royal Aberdeen Children's Hospital.jpg
CHANDOS3.jpg
The Fabs.JPG
Wembley Stadium, illuminated.jpg
Help:Simplified chart- WikipediaThen "[[File:Wikipedia-logo-v2-ja.png|thumb|Explanatory text]]Although it is actually "[[File:Wikipedia-logo-v2-ja.png|thumb|Explanatory text]]」「[[file:Wikipedia-logo-v2-ja.png|thumb|Explanatory text]]There is also a pattern.
Extract the field names and values of the "basic information" template included in the article and store them as a dictionary object.
code
import json
import regex as re
import pprint
eng_data = {}
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
jsons = []
#Since the original data is JSON Lines ≠ JSON, read line by line
lines = f.readlines()
for line in lines:
jsons.append(json.loads(line))
#Extract england
eng = list(filter(lambda e: e['title'] == 'England', jsons))
eng_data = eng[0]
#Extract text
text = eng_data['text']
#Extract basic information
basic_text = re.search('{{Basic information[\s\S]+?}}\n\n', text).group().replace('\n*', '*')
#List by line breaks and delete unnecessary parts
basic_ary = basic_text.split('\n')
del basic_ary[0]
del basic_ary[-3:]
#Change to dictionary type
basic_dict = {}
for basic in basic_ary:
key, *values = basic.split('=')
key = key.replace(' ', '').replace('|', '')
basic_dict[key] = ''.join(values).strip()
pprint.pprint(basic_dict)
Output result (part)
{'GDP/Man': '36,727<ref name"imf-statistics-gdp" />',
'GDP value': '2,316.2 billion<ref name"imf-statistics-gdp" />',
'GDP value MER': '2,433.7 billion<ref name"imf-statistics-gdp" />',
︙
'Prime Minister's name': '[[Boris Johnson]]',
'Prime Minister's title': '[[British Prime Minister|Prime Minister]]',
'capital': '[[London]](infact)'}
At the time of processing 25, remove MediaWiki's emphasis markup (all weak emphasis, emphasis, strong emphasis) from the template value and convert it to text (reference: markup quick reference table).
code
import json
import regex as re
import pprint
eng_data = {}
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
jsons = []
#Since the original data is JSON Lines ≠ JSON, read line by line
lines = f.readlines()
for line in lines:
jsons.append(json.loads(line))
#Extract england
eng = list(filter(lambda e: e['title'] == 'England', jsons))
eng_data = eng[0]
#Extract text
text = eng_data['text']
#Extract basic information
basic_text = re.search('{{Basic information[\s\S]+?}}\n\n', text).group().replace('\n*', '*')
#List by line breaks and delete unnecessary parts
basic_ary = basic_text.split('\n')
del basic_ary[0]
del basic_ary[-3:]
#Change to dictionary type
basic_dict = {}
for basic in basic_ary:
#Divided into keys and values
key, *values = basic.split('=')
#Shape the key
key = key.replace(' ', '').replace('|', '')
#Join because the values are listed
value = ''.join(values).strip()
#Removal of highlighted markup
value = value.replace("'''''", '').replace("'''", '').replace("''", '')
basic_dict[key] = value
pprint.pprint(basic_dict)
Output result (part)
{'GDP/Man': '36,727<ref name"imf-statistics-gdp" />',
'GDP value': '2,316.2 billion<ref name"imf-statistics-gdp" />',
'GDP value MER': '2,433.7 billion<ref name"imf-statistics-gdp" />',
︙
'Established form 4': 'Changed to the current country name "United Kingdom of Great Britain and Northern Ireland"',
︙
'Prime Minister's name': '[[Boris Johnson]]',
'Prime Minister's title': '[[British Prime Minister|Prime Minister]]',
'capital': '[[London]](infact)'}
Help: Quick Reference --Wikipedia "Differentiation from others (italics) ) ”,“ Emphasis (bold) ”and“ Italics and emphasis ”are understood as“ emphasis markup ”here.
In addition to the 26 processes, remove MediaWiki's internal link markup from the template value and convert it to text (reference: markup quick reference table).
code
import json
import regex as re
import pprint
eng_data = {}
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
jsons = []
#Since the original data is JSON Lines ≠ JSON, read line by line
lines = f.readlines()
for line in lines:
jsons.append(json.loads(line))
#Extract england
eng = list(filter(lambda e: e['title'] == 'England', jsons))
eng_data = eng[0]
#Extract text
text = eng_data['text']
#Extract basic information
basic_text = re.search('{{Basic information[\s\S]+?}}\n\n', text).group().replace('\n*', '*')
#List by line breaks and delete unnecessary parts
basic_ary = basic_text.split('\n')
del basic_ary[0]
del basic_ary[-3:]
#Change to dictionary type
basic_dict = {}
for basic in basic_ary:
#Divided into keys and values
key, *values = basic.split('=')
#Shape the key
key = key.replace(' ', '').replace('|', '')
#Join because the values are listed
value = ''.join(values).strip()
#Removal of highlighted markup
value = value.replace("'''''", '').replace("'''", '').replace("'", '')
#Get internal link string
targets = re.findall('((?<=({{)).+?(?=(}})))', value)
#Formatting the internal link string
if targets:
for target in targets:
value = re.sub('{{.+?}}', target[0].split('|')[-1], value, count=1)
#Get internal link string
targets = re.findall('((?<=(\[\[)).+?(?=(\]\])))', value)
#Formatting the internal link string
if targets:
for target in targets:
value = re.sub('\[\[.+?\]\]', target[0].split('|')[-1], value, count=1)
basic_dict[key] = value
pprint.pprint(basic_dict)
Output result (part)
{'GDP/Man': '36,727<ref name"imf-statistics-gdp" />',
'GDP value': '2,316.2 billion<ref name"imf-statistics-gdp" />',
'GDP value MER': '2,433.7 billion<ref name"imf-statistics-gdp" />',
︙
'Established form 3': 'United Kingdom of Great Britain and Ireland established<br />(1800 Joint Law)',
︙
'Prime Minister's name': 'Boris Johnson',
'Prime Minister's title': 'Prime Minister',
'capital': 'London (virtually)'}
Help:Simplified chart- WikipediaThen "[[Article title|Display character]]There are patterns such as ", but in reality,"{{Article title|Display character}}There seems to be a pattern of ".
In addition to the 27 processes, remove MediaWiki markup from the template values as much as possible and format the basic country information.
code
import json
import regex as re
import pprint
eng_data = {}
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
jsons = []
#Since the original data is JSON Lines ≠ JSON, read line by line
lines = f.readlines()
for line in lines:
jsons.append(json.loads(line))
#Extract england
eng = list(filter(lambda e: e['title'] == 'England', jsons))
eng_data = eng[0]
#Extract text
text = eng_data['text']
#Extract basic information
basic_text = re.search('{{Basic information[\s\S]+?}}\n\n', text).group().replace('\n*', '*')
#List by line breaks and delete unnecessary parts
basic_ary = basic_text.split('\n')
del basic_ary[0]
del basic_ary[-3:]
#Change to dictionary type
basic_dict = {}
for basic in basic_ary:
#Divided into keys and values
key, *values = basic.split('=')
#Shape the key
key = key.replace(' ', '').replace('|', '')
#Join because the values are listed
value = ''.join(values).strip()
#Removal of highlighted markup
value = value.replace("'''''", '').replace("'''", '').replace("'", '')
#Get internal link string
targets = re.findall('((?<=({{)).+?(?=(}})))', value)
#Formatting the internal link string
if targets:
for target in targets:
value = re.sub('{{.+?}}', target[0].split('|')[-1], value, count=1)
#Get internal link string
targets = re.findall('((?<=(\[\[)).+?(?=(\]\])))', value)
#Formatting the internal link string
if targets:
for target in targets:
value = re.sub('\[\[.+?\]\]', target[0].split('|')[-1], value, count=1)
#Tag removal
value = value.replace('<br />', '')
value = re.sub('<ref.+?</ref>', '', value)
value = re.sub('<ref.+?/>', '', value)
basic_dict[key] = value
pprint.pprint(basic_dict)
Output result (part)
{'GDP/Man': '36,727',
'GDP value': '2,316.2 billion',
'GDP value MER': '2,433.7 billion',
︙
'Prime Minister's name': 'Boris Johnson',
'Prime Minister's title': 'Prime Minister',
'capital': 'London (virtually)'}
Use the contents of the template to get the URL of the national flag image. (Hint: Call imageinfo in the MediaWiki API to convert file references to URLs)
code
import json
import regex as re
import requests
eng_data = {}
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
jsons = []
#Since the original data is JSON Lines ≠ JSON, read line by line
lines = f.readlines()
for line in lines:
jsons.append(json.loads(line))
#Extract england
eng = list(filter(lambda e: e['title'] == 'England', jsons))
eng_data = eng[0]
#Extract text
text = eng_data['text']
#Extract basic information
basic_text = re.search('{{Basic information[\s\S]+?}}\n\n', text).group().replace('\n*', '*')
#List by line breaks and delete unnecessary parts
basic_ary = basic_text.split('\n')
del basic_ary[0]
del basic_ary[-3:]
#Change to dictionary type
basic_dict = {}
for basic in basic_ary:
#Divided into keys and values
key, *values = basic.split('=')
#Shape the key
key = key.replace(' ', '').replace('|', '')
#Join because the values are listed
value = ''.join(values).strip()
#Removal of highlighted markup
value = value.replace("'''''", '').replace("'''", '').replace("'", '')
#Get internal link string
targets = re.findall('((?<=({{)).+?(?=(}})))', value)
#Formatting the internal link string
if targets:
for target in targets:
value = re.sub('{{.+?}}', target[0].split('|')[-1], value, count=1)
#Get internal link string
targets = re.findall('((?<=(\[\[)).+?(?=(\]\])))', value)
#Formatting the internal link string
if targets:
for target in targets:
value = re.sub('\[\[.+?\]\]', target[0].split('|')[-1], value, count=1)
#Tag removal
value = value.replace('<br />', '')
value = re.sub('<ref.+?</ref>', '', value)
value = re.sub('<ref.+?/>', '', value)
basic_dict[key] = value
#API call
session = requests.Session()
params = {
'action': 'query',
'format': 'json',
'prop': 'imageinfo',
'titles': 'File:' + basic_dict['National flag image'],
'iiprop': 'url'
}
result = session.get('https://ja.wikipedia.org/w/api.php', params=params)
res_json = result.json()
print(res_json['query']['pages']['-1']['imageinfo'][0]['url'])
Output result
https://upload.wikimedia.org/wikipedia/commons/a/ae/Flag_of_the_United_Kingdom.svg
What you can learn in Chapter 3
Recommended Posts