A record of solving the problems in the first half of Chapter 3. As you can see on the web page, the target file is jawiki-country.json, which is an extension of jawiki-country.json.gz.
There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format. Information of one article per line is stored in JSON format In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. The entire file is gzipped Create a program that performs the following processing.
Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.
# -*- coding: utf-8 -*-
__author__ = 'todoroki'
import re
import json
inputfile = 'jawiki-country.json'
outputfile = 'jawiki-england.txt'
f = open(inputfile)
g = open(outputfile, 'w')
target = re.compile(u'England')
for line in f:
res = json.loads(line)
if target.search(res[u'text']):
g.write(res['text'].encode('utf8') + '\n')
f.close()
g.close()
#=> (File jawiki-england.Output to txt)
Use re module. Since Japanese is treated as a unicode character string, it is written in the form of ʻu'UK'`. Compile to a regular expression with the compile method of the re module, and determine whether each line contains target (UK) with the search method.
Extract the line that declares the category name in the article.
# -*- coding: utf-8 -*-
__author__ = 'todoroki'
import re
inputfile = 'jawiki-england.txt'
outputfile = 'jawiki-england_category.txt'
f = open(inputfile)
g = open(outputfile, "w")
category = re.compile('\[\[Category:.+\]\]')
for line in f:
if category.match(line):
g.write(line.strip() + "\n")
f.close()
g.close()
#=> (File jawiki-england_category.Output to txt)
Same as the previous question.
Judgment whether to include [[Category: ~]]
.
Escape [
and ]
because it is in a regular expression.
Extract the article category names (by name, not line by line).
# -*- coding: utf-8 -*-
__author__ = 'todoroki'
import re
inputfile = 'jawiki-england.txt'
outputfile = 'jawiki-england_category-name.txt'
f = open(inputfile)
g = open(outputfile, "w")
category = re.compile('\[\[Category:(.+)\]\]')
for line in f:
l = category.match(line)
if l:
g.write(l.group(1) + "\n")
f.close()
g.close()
#=> (File jawiki-england_category-name.Output to txt)
Get the category name with the group method of the re module.
You can get the part patterned by the part (. +)
Enclosed in parentheses when compiling the regular expression.
If the argument is 0
, the whole match is returned, and if it is a numerical value, the patterned part of the number is returned (if the numerical value is larger than the number of patterns, IndexError is returned).
Display the section name and its level contained in the article (for example, 1 if "== section name ==").
# -*- coding: utf-8 -*-
__author__ = 'todoroki'
import re
inputfile = 'jawiki-england.txt'
outputfile = 'jawiki-england_section.txt'
f = open(inputfile)
g = open(outputfile, "w")
section = re.compile(r'=(=+) (.+) =')
for line in f:
l = section.match(line)
if l:
g.write("sec%s : " % len(l.group(1)))
g.write(l.group(2) + "\n")
f.close()
g.close()
#=> (File jawiki-england_section.Output to txt)
As in the previous problem, use the group method.
The section level is determined by the number of =
.
Extract all the media files referenced from the article.
# -*- coding: utf-8 -*-
__author__ = 'todoroki'
import re
inputfile = 'jawiki-england.txt'
outputfile = 'jawiki-england_media.txt'
f = open(inputfile)
g = open(outputfile, "w")
mediafile = re.compile(r".*(File|File|file):(.*\.[a-zA-Z0-9]+)\|.*")
for line in f:
l = mediafile.match(line)
if l:
g.write(l.group(2) + "\n")
f.close()
g.close()
#=> (File jawiki-england_media.Output to txt)
Same as the problem so far.
I found it complicated to use group, but after solving a few questions, I somehow understood it.
Recommended Posts