Please refer to First Post
9/24 added
There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format. ・ Information of one article per line is stored in JSON format -In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. -The entire file is compressed with gzip Create a program that performs the following processing.
Extract the field names and values of the "basic information" template included in the article and store them as a dictionary object.
basic_info_025.py
from training.json_read_020 import uk_find
import re
def basic_info_find(lines):
pattern1 = re.compile(r'^\{\{[redirect|Basic information].*')
pattern2 = re.compile(r'^\|.*')
pattern3 = re.compile(r'^\}\}$')
basic_dict = {}
for line in lines.split('\n'):
if pattern1.match(line):
continue
elif pattern2.match(line):
point = line.find('=')
MAX = len(line)
title = line[0:point].lstrip('|').rstrip(' ')
data = line[point:MAX].lstrip('= ')
basic_dict.update({title: data})
elif pattern3.match(line):
break
return basic_dict
if __name__=="__main__":
lines = uk_find()
basic_dict = basic_info_find(lines)
for key,value in basic_dict.items():
print(key+':'+value)
result
Established form 4:Current country name "'''United Kingdom of Great Britain and Northern Ireland'''"change to
National emblem image:[[File:Royal Coat of Arms of the United Kingdom.svg|85px|British coat of arms]]
National emblem link:([[British coat of arms|National emblem]])
(Omitted because it is long)
Process finished with exit code 0
Impression: I extracted the line starting with | from the result of the basic information and turned the loop to store it in the key and value of the dictionary before and after =. The print result was processed so that it is easy to understand.
At the time of processing 25, remove MediaWiki's emphasized markup (all of weak emphasis, emphasis, and strong emphasis) from the template value and convert it to text (reference: markup quick reference table).
emphasize_remove_026.py
from training.json_read_020 import uk_find
from training.basic_info_025 import basic_info_find
import re
def emphasize_remove(basic_dict):
pattern = re.compile(r".*'{2,4}.*")
for key,value in basic_dict.items():
if pattern.match(value):
value = value.replace("\'",'')
basic_dict.update({key:value})
return basic_dict
if __name__ == "__main__":
lines = uk_find()
basic_dict = basic_info_find(lines)
emphasize_remove_dict = emphasize_remove(basic_dict)
for key,value in emphasize_remove_dict.items():
print(key+':'+value)
result
GDP statistics year yuan:2012
Established form 4:Changed to the current country name "United Kingdom of Great Britain and Northern Ireland"
Area size:1 E11
(Omitted because it is long)
Process finished with exit code 0
Impressions: There was only one relevant part, but it is set to'{2,4} so that all emphasized markup can be searched. When I found it, I just replaced it with replace.
In addition to the 26 processes, remove MediaWiki's internal link markup from the template value and convert it to text (reference: markup quick reference table).
link_remove_027.py
from training.json_read_020 import uk_find
from training.basic_info_025 import basic_info_find
from training.emphasize_remove_026 import emphasize_remove
import re
def link_remove(emphasize_remove_dict):
pattern = re.compile(r".*\[{2}.*")
for key,value in emphasize_remove_dict.items():
if pattern.match(value):
value = value.replace('[[','').replace(']]','')
emphasize_remove_dict.update({key: value})
return emphasize_remove_dict
if __name__=="__main__":
lines = uk_find()
basic_dict = basic_info_find(lines)
emphasize_remove_dict=emphasize_remove(basic_dict)
link_remove_dict = link_remove(emphasize_remove_dict)
for key,value in link_remove_dict.items():
print(key+':'+value)
result
National emblem image:File:Royal Coat of Arms of the United Kingdom.svg|85px|British coat of arms
Official country name:{{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br/>
Founding form:Founding of the country
(Omitted because it is long)
Process finished with exit code 0
Impressions: Similar to problem 026, I just replaced [[and]] with replace when I found the internal link part starting with [[].
In addition to the 27 processes, remove MediaWiki markup from the template values as much as possible and format the basic country information.
markup_remove_028.py
from training.json_read_020 import uk_find
from training.basic_info_025 import basic_info_find
from training.emphasize_remove_026 import emphasize_remove
from training.link_remove_027 import link_remove
import re
#A function that removes pounds.
def pound_check(value):
pattern = re.compile(r".*pound.*")
if pattern.match(value):
value = value.replace("(£)",'')
return value
else:
return value
#A function that removes the br tag.
def br_check(value):
pattern1 = re.compile(r".*<br.*")
if pattern1.match(value):
value = value.replace("<br />", '').replace("<br/>", '')
return value
else:
return value
#A function that removes the ref tag and reference description.
def ref_check(value):
pattern2 = re.compile(r".*<ref.*")
if pattern2.match(value):
start_point = value.find("<ref")
value = value[0:start_point]
return value
else:
return value
#{{When}}A function that removes.
def brackets_check(value):
pattern3 = re.compile(r".*\{\{.*")
if pattern3.match(value):
value = value.replace("{{","").replace("}}","")
#lang|en|Get 4 characters or more from the first pipe when United ~#
start_point = value.find("|")+4
value = value[start_point:len(value)]
return value
else:
return value
#File: Function to remove.
def file_check(value):
pattern4 = re.compile(r".*File.*")
if pattern4.match(value):
value = value.replace('File:','')
start_point = value.find("|")
value = value[0:start_point]
return value
else:
return value
#Half-width|A function that removes.|Only with|+()Removes the existing pattern.
def pipe_check(value):
pattern5 = re.compile(r".*\|.*")
pattern6 = re.compile(r".*\(.*")
if pattern5.match(value) and pattern6.match(value) :
end_point = value.find("|")
value = value[0:end_point] + ")"
return value
elif pattern5.match(value):
end_point = value.find("|")
value = value[0:end_point]
return value
else:
return value
#Full-width (removing function
def other_check(value):
pattern7 = re.compile(r"^\(")
if pattern7.match(value):
value = value.replace("(","")
return value
else:
return value
def markup_remove(link_remove_dict):
for key,value in link_remove_dict.items():
value = pound_check(value)
value = br_check(value)
value = ref_check(value)
value = brackets_check(value)
value = file_check(value)
value = pipe_check(value)
value = other_check(value)
link_remove_dict.update({key:value})
return link_remove_dict
if __name__=="__main__":
lines = uk_find()
basic_dict = basic_info_find(lines)
emphasize_remove_dict=emphasize_remove(basic_dict)
link_remove_dict = link_remove(emphasize_remove_dict)
markup_remove_dict = markup_remove(link_remove_dict)
for key,value in markup_remove_dict.items():
print(key+':'+value)
print(len(markup_remove_dict.items()))
result
Date of establishment 1:927/843
Official country name:United Kingdom of Great Britain and Northern Ireland
Established form 1:Kingdom of England / Kingdom of Scotland (Both countries are Acts of Union)(1707))
Position image:Location_UK_EU_Europe_001.svg
Motto:Dieu et mon droit (French:God and my rights)
ccTLD:.uk / .gb
National flag image:Flag of the United Kingdom.svg
currency:Sterling pound
(Omitted because it is long)
Process finished with exit code 0
Impressions: First, I found the markup, made a compile pattern, and repeated to see what kind of markup was caught. .. .. And I decided to evaluate all patterns line by line. However, full-width notation is pear. .. .. I was really wondering why it didn't get caught. .. .. I'm tired.
Use the contents of the template to get the URL of the national flag image. (Hint: Call imageinfo in the MediaWiki API to convert file references to URLs)
get_url_029.py
# -*- coding:utf-8-*-
from training.json_read_020 import uk_find
from training.basic_info_025 import basic_info_find
import requests
import urllib.parse
import json
import re
def image_query(filename):
url = "https://commons.wikimedia.org/w/api.php?"
action = "action=query&"
titles = "titles=File:"+urllib.parse.quote(filename)+"&"
prop = "prop=imageinfo&"
iiprop="iiprop=url&"
format = "format=json"
parameter = url +action+titles+prop+iiprop+format
return parameter
def get_request(parameter):
pattern = re.compile(r".*\"url\".*")
r = requests.get(parameter)
data = r.json()
json_data =json.dumps(data["query"]["pages"]["347935"]["imageinfo"],indent=4)
for temp in json_data.split('\n'):
if(pattern.search(temp)):
url_data = temp.replace(" ","")
else:
continue
return url_data
if __name__=="__main__":
lines = uk_find()
basic_dict = basic_info_find(lines)
parameter=image_query(basic_dict['National flag image'])
get_url = get_request(parameter)
print(get_url)
result
"url":"https://upload.wikimedia.org/wikipedia/commons/a/ae/Flag_of_the_United_Kingdom.svg"
Process finished with exit code 0
Impressions: At first, I wasn't sure what to do. After googled variously, the point was to send a request to wikimedia to search for data related to the file name, and find the URL where the image file is uploaded from the response. It took me a long time to understand this subject ... It was a problem that I learned in many ways.
Recommended Posts