Please refer to First Post
9/24 added
There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format. ・ Information of one article per line is stored in JSON format -In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. -The entire file is compressed with gzip Create a program that performs the following processing.
Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.
json_read_020.py
#-*- coding:utf-8 -*-
import json
import gzip
import re
def uk_find():
basepath = '/Users/masassy/PycharmProjects/Pywork/training/'
filename = 'jawiki-country.json.gz'
pattern = r"England"
with gzip.open(basepath + filename, 'rt')as gf:
for line in gf:
# json.loads is str → dict, json.load is file → dict
json_data = json.loads(line)
if (re.match(json_data['title'], pattern)):
return json_data['text']
if __name__ == "__main__":
json_data = uk_find()
print(json_data)
result
{{redirect|UK}}
{{Basic information Country
|Abbreviated name=England
(Omitted because it is long)
[[Category:Island country|Kureito Furiten]]
[[Category:States / Regions Established in 1801]]
Process finished with exit code 0
Impressions: It took me some time to understand the data format of the file read by gzip.open and the data format of json_data.
Extract the line that declares the category name in the article.
category_021.py
from training.json_read_020 import uk_find
import re
if __name__=="__main__":
pattern = re.compile(r'.*Category.*')
lines = uk_find()
for line in lines.split('\n'):
if pattern.match(line):
print(line)
result
[[Category:England|*]]
[[Category:Commonwealth Kingdom|*]]
[[Category:G8 member countries]]
[[Category:European Union member states]]
[[Category:Maritime nation]]
[[Category:Sovereign country]]
[[Category:Island country|Kureito Furiten]]
[[Category:States / Regions Established in 1801]]
Process finished with exit code 0
Impressions: It took me a while to realize that the regular expression. \ * Search character. \ * And lines.split ('\ n') combination can return lines that contain the search character.
Extract the article category names (by name, not line by line).
category_str_022.py
from training.json_read_020 import uk_find
import re
if __name__=="__main__":
pattern = re.compile(r'.*Category:.*')
pattern2 = re.compile(r'.*\|.*')
lines = uk_find()
for line in lines.split('\n'):
if pattern.match(line):
strip_line=line.lstrip("[[Category:").rstrip("]]")
if pattern2.match(strip_line):
N = strip_line.find('|')
strip_line2 = strip_line[:N]
print(strip_line2)
else:
print(strip_line)
result
England
Commonwealth Kingdom
G8 member countries
European Union member states
Maritime nation
Sovereign country
Island country
States / Regions Established in 1801
Process finished with exit code 0
Impression: After extracting Category, it is in the line|If there is|The point of ingenuity is the part that specifies the place to slice at the position of.
Display the section name and its level contained in the article (for example, 1 if "== section name ==").
section_023.py
import re
from training.json_read_020 import uk_find
if __name__=="__main__":
pattern = re.compile(r'^=.*')
pattern2 = re.compile(r'^={2}')
pattern3 = re.compile(r'^={3}')
pattern4 = re.compile(r'^={4}')
lines=uk_find()
for line in lines.split('\n'):
if pattern.match(line):
if pattern4.match(line):
print(line.lstrip('====').rstrip('====')+':Level 4')
elif pattern3.match(line):
print(line.lstrip('===').rstrip('===')+':Level 3')
elif pattern2.match(line):
print(line.lstrip('==').rstrip('==')+':Level 2')
else:
print('no match')
result
Country name:Level 2
history:Level 2
Geography:Level 2
climate:Level 3
(Omitted because it is long)
Process finished with exit code 0
Impressions: After creating four compilation patterns, I first extracted the lines starting with =, and then branched the process according to the level, but it was quite forcible. There seems to be another good way.
Extract all the media files referenced in the article.
media_024.py
from training.json_read_020 import uk_find
import re
if __name__=="__main__":
pattern = re.compile(r".*(File|File).*")
pattern2 = re.compile(r"^|.*")
lines = uk_find()
for line in lines.split('\n'):
if pattern2.search(line):
line = line.lstrip('|')
if pattern.search(line):
start = line.find(':')+1
end = line.find('|')
print(line[start:end])
result
Royal Coat of Arms of the United Kingdom.svg
Battle of Waterloo 1815.PNG
The British Empire.png
Uk topo en.jpg
BenNevis2005.jpg
(Omitted because it is long)
Impressions: Markup Quick Reference to determine slice position The point of ingenuity is to carry out processing that matches the format of the file part from
Recommended Posts