This article is a sequel to my book Introduction to Python with 100 Knocking on Language Processing. I will explain 100 knocks Chapter 3.
This chapter uses regular expressions. In Python, it is handled by the re
module. To use this module, it is necessary to understand not only the regular expression itself, but also the method and match object, so it is rather difficult. I don't think it's an entry level anymore. I didn't feel like I could write a better commentary than the Official Tutorial, so please read through it.
However, Python regular expressions are slow, so I try to avoid them as much as possible.
For the time being, download the file appropriately. * This file is distributed under a Creative Commons Attribution-Inheritance 3.0 non-portable license. *
$ wget https://nlp100.github.io/data/jawiki-country.json.gz
According to the problem statement, the information of one article per line is stored in JSON format. The JSON format is a simple export of arrays and dictionaries, and many programming languages support this format. However, the format of this entire file is called JSONL (JSON Lines). Take a look at the contents of the file with $ gunzip -c jawiki-country.json.gz | less
etc. (you may see it directly with less
).
json Python also has a library to handle this JSON easily. Its name is also json. This is an example brought from Document, but it makes a json string into a Python object like this, and vice versa. It's very easy to do.
import json
dic = json.loads('{"bar":["baz", null, 1.0, 2]}')
print(type(dic))
print(dic)
<class 'dict'>
{'bar': ['baz', None, 1.0, 2]}
dumped = json.dumps(dic)
print(type(dumped))
print(dumped)
<class 'str'>
{"bar": ["baz", null, 1.0, 2]}
Since it was difficult to understand, I also displayed the type name with type ()
. By the way, s
in loads
and dumps
means string
instead of 3 units.
Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.
The downloaded file is in gz format, but I don't want to expand it as much as possible. It is better to read it with Python's gzip
module, or to output the expansion result as standard with the Unix command and connect it with a pipe.
Below is an example of the answer.
q20.py
import json
import sys
for line in sys.stdin:
wiki_dict = json.loads(line)
if wiki_dict['title'] == 'England':
print(wiki_dict.get('text'))
$ gunzip -c jawiki-country.json.gz | python q20.py > uk.txt
$ head -n5 uk.txt
{{redirect|UK}}
{{redirect|United Kingdom|The princes of the Spring and Autumn period|English(Spring and autumn)}}
{{Otheruses|European country|Local cuisine of Nagasaki and Kumamoto prefectures|England}}
{{Basic information country | Abbreviation = United Kingdom
I can get redirects and articles with the same name, but there should be no problem.
Extract the line that declares the category name in the article.
See the Wikipedia Markup Quick Reference and the contents of the actual file Let's think about it.
It seems enough to extract the lines that start with '[[Category'
. str.startswith (prefix)
will return the truth value of whether the string starts with prefix
.
Below is an example of the answer.
q21.py
import sys
for line in sys.stdin:
if line.startswith('[[Category'):
print(line.rstrip())
(I remember that the 2015 version had a mixture of lowercase [[category
s, but it's gone in the 2020 version ...)
Extract the article category names (by name, not line by line).
It will be like this if you cut your hand.
q22.py
import sys
for line in sys.stdin:
print(line.lstrip("[Category:").rstrip("|*]\n"))
$ python q21.py < uk.txt | python q22.py
England Commonwealth of Nations Commonwealth Kingdom G8 member countries European Union Member States | Former Maritime nation Existing sovereign country Island country A nation / territory established in 1801
Display the section name and its level contained in the article (for example, 1 if "== section name ==").
Change == country name ==
to country name 1
. You can count sub
s in a string withstr.count (sub)
.
Below is an example of an answer that does not use regular expressions.
q23.py
import sys
for line in sys.stdin:
if line.startswith('=='):
sec_name = line.strip('= \n')
level = int(line.count('=')/2 - 1)
print(sec_name, level)
Extract all the media files referenced from the article.
All 2020 editionsFile:Battle of Waterloo 1815.PNG|
It is shaped like this.|
From now on, use regular expressions, noting that you want to remove them, and that there may be more than one on a line. Testing regular expressions is easy with online check tools.
Below is an example of the answer.
q24.py
import re
import sys
pat = re.compile(r'(File:)(?P<filename>.+?)\|')
for line in sys.stdin:
for match in pat.finditer(line):
print(match.group('filename'))
.+?\|
In "After as few repetitions of any character as possible|
"It means that. When considering multiple matchesfinditer()
Is convenient. If there is no match, in the first placefor
The sentence doesn't turn.
The same result can be obtained if the argument of group
is 2.
Extract the field names and values of the "basic information" template included in the article and store them as a dictionary object.
It is troublesome to handle the fields that have line breaks in the template.
{{Basic information Country
|Abbreviated name=England
|Japanese country name=United Kingdom of Great Britain and Northern Ireland
|Official country name= {{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br/>
*{{lang|gd|An Rìoghachd Aonaichte na Breatainn Mhòr agus Eirinn mu Thuath}}([[Scottish Gaelic]])<br/>
*{{lang|cy|Teyrnas Gyfunol Prydain Fawr a Gogledd Iwerddon}}([[Welsh]])<br/>
Below is an example of the answer.
q25.py
import sys
import json
def main():
dic = extract_baseinf(sys.stdin)
sys.stdout.write(json.dumps(dic, ensure_ascii=False))
def extract_baseinf(fi):
baseinf = {}
isbaseinf = False
for line in fi:
if isbaseinf:
if line.startswith('}}'):
return baseinf
elif line[0] == '|':
templis = line.strip('|\n').split('=')
key = templis[0].rstrip()
value = "=".join(templis[1:]).lstrip()
baseinf[key] = value
else:
value = line.rstrip('\n')
baseinf[key] += f"\n{value}"
elif line.startswith('{{Basic information'):
isbaseinf = True
if __name__ == '__main__':
main()
!python q25.py < uk.txt > uk_baseinf.json
If it spans multiple lines, it is processed by concatenating them.
I will write it to json once because the code in the next problem will be complicated. At this time, the characters will be garbled unless ʻensure_ascii = False`.
At the time of processing> 25, remove the MediaWiki emphasis markup (all weak emphasis, emphasis, strong emphasis) from the template value and convert it to text (Reference: [Markup Quick Reference](http: // ja. wikipedia.org/wiki/Help:% E6% 97% A9% E8% A6% 8B% E8% A1% A8)).
If '
appears 2, 3 or 5 times in a row, delete it. It looks like r'{2,5}. +?'{2,5}
in regular expression, but it's hard to do it seriously. As usual, if you do it without using a regular expression, it will look like this.
q26.py
import json
import sys
def main():
dic = json.loads(sys.stdin.read())
dic = remove_emphasis(dic)
print(json.dumps(dic, ensure_ascii=False, indent=4))
def remove_emphasis(dic):
for key, value in dic.items():
for n in (5, 3, 2):
eliminated = value.split("'" * n)
div, mod = divmod(len(eliminated), 2)
if mod == 1 and div > 0:
value = ''.join(eliminated)
dic[key] = value
return dic
if __name__ == '__main__':
main()
The flow is to read the json file created in the previous question from the standard input and change the value of the dictionary object. dict.items ()
is an iterator that returns a series of (key, value)
pairs in the dictionary. Let's remember it.
If you want to use '
in a string literal, you need to escape it or enclose it in "
outside it. To make the same string contiguous, you can multiply it by an integer. Andsplit ()
I try to delete '
and determine if the number of elements in the returned list is odd so that I don't delete irregular '
such as ʻa''b. It can be calculated by
%, but it can be left as it is even when the quotient is 0, so by using the built-in function
divmod ()`, the quotient and the remainder are calculated at the same time.
The conditional expression ʻA and B is new to me, but you can see it by looking at it. The same is true for ʻor
. What is important is its evaluation strategy. If ʻA and B is found to be ʻA == False
, the evaluation of the expression ends without evaluating B
. Therefore, it is more efficient to make ʻA more likely to be
Falsethan
B. Similarly, ʻA or B
does not evaluate B
if it turns out to be ʻA == True, so write an expression in ʻA
that is more likely to be True
.
In addition to processing 26, remove MediaWiki's internal link markup from the template value and convert it to text (Reference: Markup Quick Reference % E6% 97% A9% E8% A6% 8B% E8% A1% A8)).
Since there are 3 patterns, we will use regular expressions.
q27.py
"""
[[Article title]]
[[Article title|Display character]]
[[Article title#Section name|Display character]]
"""
import json
import re
import sys
from q26 import remove_emphasis
def main():
dic = json.loads(sys.stdin.read())
dic = remove_emphasis(dic)
dic = remove_link(dic)
print(json.dumps(dic, ensure_ascii=False, indent=4))
def remove_link(dic):
pat = re.compile(r"""
\[\[ # [[
([^|]+\|)* #Article title|Not or repeated
([^]]+) #Replace the part that matches the display character pat with this one
\]\] # ]]
""", re.VERBOSE)
for key, value in dic.items():
value = pat.sub(r'\2', value)
dic[key] = value
return dic
if __name__ == '__main__':
main()
After processing the previous question, it is a flow to change the value of the dictionary again.
You can write a string literal that spans multiple lines by enclosing it in triple quotes. Furthermore, in re.VERBOSE
, spaces, line breaks, and comments are ignored in the regular expression, but it's still hard to see ...
The part of pat.sub (r'\ 2', value)
means to replace the part of value
that matches with pat
with group (2)
of the match object. ..
In addition to the> 27 process, remove MediaWiki markup from the template values as much as possible and format the basic country information.
You can do it with Pandoc and pypandoc ... If you do your best with regular expressions, you should delete highlighted markup, internal links, file references, external links, <ref>
, <br />
, {{0}}
, only regular expressions. I'll put it ...
basic_info = re.compile(r"\|(.+?)\s=\s(.+)")
emphasize = re.compile(r"('+){2,5}(.+?)('+){2,5}")
link_inner = re.compile(r"\[\[(.+?)\]\]")
file_ref = re.compile(r"\[\[File:.+?\|.+?\|(.+?)\]\]")
ref = re.compile(r"<ref((\s.+?)>|(>.+?)</ref>)")
link_website = re.compile(r"\[.+?\]")
lang_template = re.compile(r"{{.+?\|.+?\|(.+?)}}")
br = re.compile(r"<.+?>")
space = re.compile(r"{{0}}")
Use the contents of the template to get the URL of the national flag image. (Hint: MediaWiki API imageinfo .2F_ii) can be called to convert the file reference to a URL)
It seems that you should request https://commons.wikimedia.org/w/api.php
with various parameters (file name etc.). If you google "mediawiki api image info" etc., the parameters will come out. You can use ʻurllib` to hit the API with the Python standard module. In the documentation Examples of Use, "The following is an example of a session to get a URL containing parameters using the GET method. You can do it by looking at the "is:" part.
Below is an example of the answer.
q29.py
import json
import sys
from urllib import request, parse
import re
baseinf = json.loads(sys.stdin.read())
url = 'https://commons.wikimedia.org/w/api.php'
params = {'action': 'query', 'prop': 'imageinfo', 'iiprop': 'url',
'format': 'json', 'titles': f'File:{baseinf["National flag image"]}'}
req = request.Request(f'{url}?{parse.urlencode(params)}')
with request.urlopen(req) as res:
body = res.read()
# print(body['query']['pages']['347935']['imageinfo'][0]['url'])
print(re.search(r'"url":"(.+?)"', body.decode()).group(1))
!python q29.py < uk_baseinf.json
https://upload.wikimedia.org/wikipedia/commons/a/ae/Flag_of_the_United_Kingdom.svg
Since the returned JSON file is complicated, it is more practical to look for the URL-like part. For some reason, body
was a byte string, so it didn't work unless I decoded it.
re
json
str.startswith()
dict.items()
--ʻAnd, ʻor
and its evaluation strategyurllib
Personally, this chapter was painful. Is it like NLP from the next time?
Recommended Posts