This article is a sequel to my book Introduction to Python with 100 Knocking on Language Processing. I will explain 100 knocks Chapter 3.

This chapter uses regular expressions. In Python, it is handled by the re module. To use this module, it is necessary to understand not only the regular expression itself, but also the method and match object, so it is rather difficult. I don't think it's an entry level anymore. I didn't feel like I could write a better commentary than the Official Tutorial, so please read through it.

However, Python regular expressions are slow, so I try to avoid them as much as possible.

For the time being, download the file appropriately. * This file is distributed under a Creative Commons Attribution-Inheritance 3.0 non-portable license. *

$ wget https://nlp100.github.io/data/jawiki-country.json.gz

According to the problem statement, the information of one article per line is stored in JSON format. The JSON format is a simple export of arrays and dictionaries, and many programming languages support this format. However, the format of this entire file is called JSONL (JSON Lines). Take a look at the contents of the file with $ gunzip -c jawiki-country.json.gz | less etc. (you may see it directly with less).

json Python also has a library to handle this JSON easily. Its name is also json. This is an example brought from Document, but it makes a json string into a Python object like this, and vice versa. It's very easy to do.

import json
dic = json.loads('{"bar":["baz", null, 1.0, 2]}')
print(type(dic))
print(dic)

<class 'dict'>
{'bar': ['baz', None, 1.0, 2]}

dumped = json.dumps(dic)
print(type(dumped))
print(dumped)

<class 'str'>
{"bar": ["baz", null, 1.0, 2]}

Since it was difficult to understand, I also displayed the type name with type (). By the way, s in loads and dumps means string instead of 3 units.

20. Read JSON data

Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.

The downloaded file is in gz format, but I don't want to expand it as much as possible. It is better to read it with Python's gzip module, or to output the expansion result as standard with the Unix command and connect it with a pipe.

Below is an example of the answer.

`q20.py`


import json
import sys


for line in sys.stdin:
    wiki_dict = json.loads(line)
    if wiki_dict['title'] == 'England':
        print(wiki_dict.get('text'))

$ gunzip -c jawiki-country.json.gz | python q20.py > uk.txt

$ head -n5 uk.txt

{{redirect|UK}}
 {{redirect|United Kingdom|The princes of the Spring and Autumn period|English(Spring and autumn)}}
 {{Otheruses|European country|Local cuisine of Nagasaki and Kumamoto prefectures|England}}

{{Basic information country | Abbreviation = United Kingdom

I can get redirects and articles with the same name, but there should be no problem.

21. Extract rows containing category names

Extract the line that declares the category name in the article.

See the Wikipedia Markup Quick Reference and the contents of the actual file Let's think about it.

It seems enough to extract the lines that start with '[[Category'. str.startswith (prefix) will return the truth value of whether the string starts with prefix.

Below is an example of the answer.

`q21.py`


import sys


for line in sys.stdin:
    if line.startswith('[[Category'):
        print(line.rstrip())

(I remember that the 2015 version had a mixture of lowercase [[categorys, but it's gone in the 2020 version ...)

22. Extraction of category name

Extract the article category names (by name, not line by line).

It will be like this if you cut your hand.

`q22.py`


import sys


for line in sys.stdin:
    print(line.lstrip("[Category:").rstrip("|*]\n"))

$ python q21.py < uk.txt | python q22.py

England Commonwealth of Nations Commonwealth Kingdom G8 member countries European Union Member States | Former Maritime nation Existing sovereign country Island country A nation / territory established in 1801

23. Section structure

Display the section name and its level contained in the article (for example, 1 if "== section name ==").

Change == country name == to country name 1. You can count subs in a string withstr.count (sub).

Below is an example of an answer that does not use regular expressions.

`q23.py`


import sys

for line in sys.stdin:
    if line.startswith('=='):
        sec_name = line.strip('= \n')
        level = int(line.count('=')/2 - 1)
        print(sec_name, level)

24. Extracting file references

Extract all the media files referenced from the article.

All 2020 editionsFile:Battle of Waterloo 1815.PNG|It is shaped like this.|From now on, use regular expressions, noting that you want to remove them, and that there may be more than one on a line. Testing regular expressions is easy with online check tools.

Below is an example of the answer.

`q24.py`


import re
import sys

pat = re.compile(r'(File:)(?P<filename>.+?)\|')
for line in sys.stdin:
    for match in pat.finditer(line):
        print(match.group('filename'))

.+?\|In "After as few repetitions of any character as possible|"It means that. When considering multiple matchesfinditer()Is convenient. If there is no match, in the first placeforThe sentence doesn't turn.

The same result can be obtained if the argument of group is 2.

25. Template extraction

Extract the field names and values of the "basic information" template included in the article and store them as a dictionary object.

It is troublesome to handle the fields that have line breaks in the template.

{{Basic information Country
|Abbreviated name=England
|Japanese country name=United Kingdom of Great Britain and Northern Ireland
|Official country name= {{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br/>
*{{lang|gd|An Rìoghachd Aonaichte na Breatainn Mhòr agus Eirinn mu Thuath}}（[[Scottish Gaelic]]）<br/>
*{{lang|cy|Teyrnas Gyfunol Prydain Fawr a Gogledd Iwerddon}}（[[Welsh]]）<br/>

Below is an example of the answer.

`q25.py`


import sys
import json


def main():
    dic = extract_baseinf(sys.stdin)
    sys.stdout.write(json.dumps(dic, ensure_ascii=False))


def extract_baseinf(fi):
    baseinf = {}
    isbaseinf = False
    for line in fi:
        if isbaseinf:
            if line.startswith('}}'):
                return baseinf

            elif line[0] == '|':
                templis = line.strip('|\n').split('=')
                key = templis[0].rstrip()
                value = "=".join(templis[1:]).lstrip()
                baseinf[key] = value
                
            else:
                value = line.rstrip('\n')
                baseinf[key] += f"\n{value}"

        elif line.startswith('{{Basic information'):
            isbaseinf = True


if __name__ == '__main__':
    main()

!python q25.py < uk.txt > uk_baseinf.json

If it spans multiple lines, it is processed by concatenating them.

I will write it to json once because the code in the next problem will be complicated. At this time, the characters will be garbled unless ʻensure_ascii = False`.

26. Removal of highlighted markup

At the time of processing> 25, remove the MediaWiki emphasis markup (all weak emphasis, emphasis, strong emphasis) from the template value and convert it to text (Reference: [Markup Quick Reference](http: // ja. wikipedia.org/wiki/Help:% E6% 97% A9% E8% A6% 8B% E8% A1% A8)).

If ' appears 2, 3 or 5 times in a row, delete it. It looks like r'{2,5}. +?'{2,5} in regular expression, but it's hard to do it seriously. As usual, if you do it without using a regular expression, it will look like this.

`q26.py`


import json
import sys


def main():
    dic = json.loads(sys.stdin.read())
    dic = remove_emphasis(dic)
    print(json.dumps(dic, ensure_ascii=False, indent=4))

    
def remove_emphasis(dic):
    for key, value in dic.items():
        for n in (5, 3, 2):
            eliminated = value.split("'" * n)
            div, mod = divmod(len(eliminated), 2)
            if mod == 1 and div > 0:
                value = ''.join(eliminated)
                dic[key] = value
    return dic


if __name__ == '__main__':
    main()

The flow is to read the json file created in the previous question from the standard input and change the value of the dictionary object. dict.items () is an iterator that returns a series of (key, value) pairs in the dictionary. Let's remember it.

If you want to use ' in a string literal, you need to escape it or enclose it in " outside it. To make the same string contiguous, you can multiply it by an integer. Andsplit () I try to delete ' and determine if the number of elements in the returned list is odd so that I don't delete irregular ' such as ʻa''b. It can be calculated by %, but it can be left as it is even when the quotient is 0, so by using the built-in function divmod ()`, the quotient and the remainder are calculated at the same time.

The conditional expression ʻA and B is new to me, but you can see it by looking at it. The same is true for ʻor. What is important is its evaluation strategy. If ʻA and B is found to be ʻA == False, the evaluation of the expression ends without evaluating B. Therefore, it is more efficient to make ʻA more likely to be FalsethanB. Similarly, ʻA or B does not evaluate B if it turns out to be ʻA == True, so write an expression in ʻA that is more likely to be True.

27. Removal of internal links

In addition to processing 26, remove MediaWiki's internal link markup from the template value and convert it to text (Reference: Markup Quick Reference % E6% 97% A9% E8% A6% 8B% E8% A1% A8)).

Since there are 3 patterns, we will use regular expressions.

`q27.py`


"""
[[Article title]]
[[Article title|Display character]]
[[Article title#Section name|Display character]] 
"""
import json
import re
import sys


from q26 import remove_emphasis


def main():
    dic = json.loads(sys.stdin.read())
    dic = remove_emphasis(dic)
    dic = remove_link(dic)
    print(json.dumps(dic, ensure_ascii=False, indent=4))


def remove_link(dic):
    pat = re.compile(r"""
        \[\[        # [[
        ([^|]+\|)*  #Article title|Not or repeated
        ([^]]+)     #Replace the part that matches the display character pat with this one
        \]\]        # ]]
    """, re.VERBOSE)
    for key, value in dic.items():
        value = pat.sub(r'\2', value)
        dic[key] = value
    return dic

if __name__ == '__main__':
    main()

After processing the previous question, it is a flow to change the value of the dictionary again.

You can write a string literal that spans multiple lines by enclosing it in triple quotes. Furthermore, in re.VERBOSE, spaces, line breaks, and comments are ignored in the regular expression, but it's still hard to see ...

The part of pat.sub (r'\ 2', value) means to replace the part of value that matches with pat with group (2) of the match object. ..

28. Removal of MediaWiki markup

In addition to the> 27 process, remove MediaWiki markup from the template values as much as possible and format the basic country information.

You can do it with Pandoc and pypandoc ... If you do your best with regular expressions, you should delete highlighted markup, internal links, file references, external links, <ref>, <br />, {{0}}, only regular expressions. I'll put it ...

basic_info = re.compile(r"\|(.+?)\s=\s(.+)")
emphasize = re.compile(r"('+){2,5}(.+?)('+){2,5}")
link_inner = re.compile(r"\[\[(.+?)\]\]")
file_ref = re.compile(r"\[\[File:.+?\|.+?\|(.+?)\]\]")
ref = re.compile(r"<ref((\s.+?)>|(>.+?)</ref>)")
link_website = re.compile(r"\[.+?\]")
lang_template = re.compile(r"{{.+?\|.+?\|(.+?)}}")
br = re.compile(r"<.+?>")
space = re.compile(r"{{0}}")

29. Get the URL of the national flag image

Use the contents of the template to get the URL of the national flag image. (Hint: MediaWiki API imageinfo .2F_ii) can be called to convert the file reference to a URL)

It seems that you should request https://commons.wikimedia.org/w/api.php with various parameters (file name etc.). If you google "mediawiki api image info" etc., the parameters will come out. You can use ʻurllib` to hit the API with the Python standard module. In the documentation Examples of Use, "The following is an example of a session to get a URL containing parameters using the GET method. You can do it by looking at the "is:" part.

Below is an example of the answer.

`q29.py`


import json
import sys
from urllib import request, parse
import re


baseinf = json.loads(sys.stdin.read())

url = 'https://commons.wikimedia.org/w/api.php'
params = {'action': 'query', 'prop': 'imageinfo', 'iiprop': 'url',
            'format': 'json', 'titles': f'File:{baseinf["National flag image"]}'}

req = request.Request(f'{url}?{parse.urlencode(params)}')
with request.urlopen(req) as res:
    body = res.read()

# print(body['query']['pages']['347935']['imageinfo'][0]['url'])
print(re.search(r'"url":"(.+?)"', body.decode()).group(1))

!python q29.py < uk_baseinf.json

https://upload.wikimedia.org/wikipedia/commons/a/ae/Flag_of_the_United_Kingdom.svg

Since the returned JSON file is complicated, it is more practical to look for the URL-like part. For some reason, body was a byte string, so it didn't work unless I decoded it.

Summary

re
json
str.startswith()
dict.items() --ʻAnd, ʻor and its evaluation strategy
urllib

Next is Chapter 4

Personally, this chapter was painful. Is it like NLP from the next time?

Next Chapter

[Chapter 3] Introduction to Python with 100 knocks of language processing

20. Read JSON data

q20.py

21. Extract rows containing category names

q21.py

22. Extraction of category name

q22.py

23. Section structure

q23.py

24. Extracting file references

q24.py

25. Template extraction

q25.py

26. Removal of highlighted markup

q26.py

27. Removal of internal links

q27.py

28. Removal of MediaWiki markup

29. Get the URL of the national flag image

q29.py

Summary

Next is Chapter 4

`q20.py`

`q21.py`

`q22.py`

`q23.py`

`q24.py`

`q25.py`

`q26.py`

`q27.py`

`q29.py`