Introduction

Knock 100 Language Processing published on the Tohoku University Inui-Okazaki Laboratory Web page for training in natural language processing and Python. I will challenge nlp100 /). I plan to make a note of the code implemented in it and the techniques that should be suppressed. The code will also be published on GitHub.

-Chapter 1 -Chapter 2, Part 1 -Chapter 2, Part 2 is continued.

excuse

Thanks to skipping for a while, I ended up writing an article while reading the code I wrote before. A style of going "I am another person three days ago" on the ground. During that time, my proficiency level changed considerably, and I was looking at my code while talking to each other. There is a gap between updates, but I hope you can use it as a stone from another mountain.

Chapter 3: Regular Expressions

There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format.

One article information per line is stored in JSON format In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. The entire file is gzipped Create a program that performs the following processing.

20. Read JSON data

Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.

Answer

`20.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 20.py

import json

with open("jawiki-country.json") as f:
    article_json = f.readline()
    while article_json:
        article_dict = json.loads(article_json)
        if article_dict["title"] == u"England":
            print(article_dict["text"])
        article_json = f.readline()

comment

The jawiki-country.json.gz used this time is 9.9MB, which is quite heavy, so I read it line by line with readline () and only the "UK" article is print (other articles are through). I feel that the operation will stop for a while if I do readlines (), and if there is a wider range of uses, I will operate only for "UK" articles, so I implemented it like this.

In this text data, each line of the file is described in JSON format. However, just loading (json.load ()) does not work well and the advantage of JSON cannot be utilized, so I used json.loads () to convert it to JSON format (this time it is a real dictionary). I will.

modularization

From here, the work of extracting only the articles of "UK" will continue for a while, so I modularized it as follows.

`extract_from_json.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# extract_from_json.py

import json


def extract_from_json(title):
    with open("jawiki-country.json") as f:
        json_data = f.readline()
        while json_data:
            article_dict = json.loads(json_data)
            if article_dict["title"] == title:
                return article_dict["text"]
            else:
                json_data = f.readline()
    return ""

Unlike 20.py, this function will return the string of the article if you pass the title as an argument (empty string if it doesn't exist).

21. Extract rows containing category names

Extract the line that declares the category name in the article.

Answer

`21.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 21.py

from mymodule import extract_from_json

lines = extract_from_json(u"England").split("\n")

for line in lines:
    if "Category" in line:
        print(line)

#With python3, this can still be displayed (although it is a list)
# print([line for line in lines if "Category" in line])

comment

It's a regular expression chapter, but it doesn't use regular expressions. Well, this one is easier to understand ... That's why only the lines containing the string "Category" are print.

If you write it in intensional notation, it will fit neatly, but in Python2, if you just print a list containing Unicode strings, it will be escaped. Therefore, it is not displayed in a form that can be read as Japanese. This code can be executed in Python3, so if you execute it in Python3, it will be processed well.

22. Extraction of category name

Extract the article category names (by name, not line by line).

Answer

`22.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 22.py

import re
from mymodule import extract_from_json

lines = extract_from_json(u"England").split("\n")

for line in lines:
    category_line = re.search("^\[\[Category:(.*?)(|\|.*)\]\]$", line)
    if category_line is not None:
        print(category_line.group(1))

comment

First, extract the category line as in 21., and extract only the name from it using re.search (). re.search () returns aMatchObject instance if there is a part of the string specified by the 2nd argument that matches the regular expression pattern of the 1st argument. Aside from what MatchObject looks like, you can use.group ()to get the matched string. In this case, category_line.group (0) has the entire matched string (eg " [[Category: UK | *]] "), but category_line.group (1) has the first matched part. You will get a string (eg UK).

And although it is an important regular expression, the details are thrown to Official Document, and specific adaptation examples are followed on this page. I will like to try. Click here for the category line to be processed this time (execution result of 21.py)

`22.Execution result of py`


$ python 22.py
[[Category:England|*]]
[[Category:Commonwealth Kingdom|*]]
[[Category:G8 member countries]]
[[Category:European Union member states]]
[[Category:Maritime nation]]
[[Category:Sovereign country]]
[[Category:Island country|Kureito Furiten]]
[[Category:States / Regions Established in 1801]]

Basically, it is [[Category: category name]], but there are some that specify reading by separating with |. So as a policy,

--First, it starts with [[Category: --Some kind of character string (category name) comes --In some cases, reading kana separated by | comes --Finally, close with ]]

It will be in the form of. Expressing this as a regular expression (I'm not sure if it's optimal) is " ^ \ [\ [Category: (. *?) (\ |. *) * \] \] $ ".

Intention	Actual regular expression	Commentary
First`[[Category:`Begins with	`^\[\[Category:`	^Specify the beginning with
Some kind of character string (category name) comes	`(.*?)`	Shortest match with any string
In some cases`\|`The reading kana separated by	`(\\|.)`	`(\\|.)?`May be more appropriate
Finally`]]`Tighten with	`\]\]$`	Indicates the end`$`May not be necessary

23. Section structure

Display the section name and its level contained in the article (for example, 1 if "== section name ==").

Answer

`23.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 23.py

import re
from mymodule import extract_from_json

lines = extract_from_json(u"England").split("\n")

for line in lines:
    section_line = re.search("^(=+)\s*(.*?)\s*(=+)$", line)
    if section_line is not None:
        print(section_line.group(2), len(section_line.group(1)) - 1)

comment

The basic structure is the same as 22., but this time the section name (e.g. == section ==) is the target, so we will pick it up. Since there was a slight fluctuation in the notation (== section ==, == section ==), a space character \ s is inserted between them so that it can be absorbed. Since the section level corresponds to the length of == (== 1 ==, === 2 ===, ...), it is calculated by getting the length and -1. It is.

24. Extracting file references

Extract all the media files referenced from the article.

Answer

`24.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 24.py

import re
from mymodule import extract_from_json

lines = extract_from_json(u"England").split("\n")

for line in lines:
    file_line = re.search(u"(File|File):(.*?)\|", line)
    if file_line is not None:
        print(file_line.group(2))

comment

Initially, only those starting with File: were extracted ... stupid.

It is ʻUnicodebecause Japanese is included in the regular expression pattern, but it seems that it is allowed in the Python regular expression pattern. I often see examples of raw strings asr" hogehoge ", but at least it's not a must, as it prevents the escaping process from duplicating and becoming hard to read? Furthermore, if you want to reuse regular expression patterns repeatedly, it seems more efficient to compile using re.compile ()`. However, the last used regular expression pattern is cached, so you don't need to worry about it this time.

25. Template extraction

Extract the field names and values of the "basic information" template included in the article and store them as a dictionary object.

Answer

`25.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 25.py

import re
from mymodule import extract_from_json

temp_dict = {}
lines = re.split(r"\n[\|}]", extract_from_json(u"England"))

for line in lines:
    temp_line = re.search("^(.*?)\s=\s(.*)", line, re.S)
    if temp_line is not None:
        temp_dict[temp_line.group(1)] = temp_line.group(2)

for k, v in sorted(temp_dict.items(), key=lambda x: x[1]):
    print(k, v)

comment

The ~~ template is included in the form | template name = template content, so it is a regular expression that matches it. As mentioned above, if you write ^ \ | (. *?) \ S = \ s (. *), The first parenthesis is the template name and the second parenthesis is the template content, so put it in the dictionary. It is stored. ~~

Basically, the template is stored ** in each line ** in the form of | template name = template contents, but the official country name was a little troublesome.

`Official country name`


|Official country name= {{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br/>
*{{lang|gd|An Rìoghachd Aonaichte na Breatainn Mhòr agus Eirinn mu Thuath}}（[[Scottish Gaelic]]）<br/>
*{{lang|cy|Teyrnas Gyfunol Prydain Fawr a Gogledd Iwerddon}}（[[Welsh]]）<br/>
*{{lang|ga|Ríocht Aontaithe na Breataine Móire agus Tuaisceart na hÉireann}}（[[Irish]]）<br/>
*{{lang|kw|An Rywvaneth Unys a Vreten Veur hag Iwerdhon Glédh}}（[[Cornish]]）<br/>
*{{lang|sco|Unitit Kinrick o Great Breetain an Northren Ireland}}（[[Scots]]）<br/>
**{{lang|sco|Claught Kängrick o Docht Brätain an Norlin Airlann}}、{{lang|sco|Unitet Kängdom o Great Brittain an Norlin Airlann}}(Ulster Scots)</ref>

As mentioned above, it spans multiple lines including line breaks (character = \ n), so it is necessary to handle this area well.

After all

When doing split (), use \ n | or \ n} instead of \ n (re.split () )
At the border of the template split ()
} is involved because | does not appear at the very end.
Flag re.S to dore.search ()including \ n
Since | is blown away by split (), consider it with search ()

I made it through various trials and errors.

For the time being, I use for loop to print to check the contents, but Python3 is also recommended. Python3 is useful for some reason ...

26. Removal of highlighted markup

At the time of processing> 25, remove the MediaWiki emphasis markup (all of weak emphasis, emphasis, and strong emphasis) from the template value and convert it to text (Reference: [Markup Quick Reference](https: // ja. wikipedia.org/wiki/Help: quick reference table)).

Answer

`26.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 26.py

import re
from mymodule import extract_from_json

temp_dict = {}
lines = re.split(r"\n[\|}]", extract_from_json(u"England"))

for line in lines:
    temp_line = re.search("^(.*?)\s=\s(.*)", line, re.S)
    if temp_line is not None:
        temp_dict[temp_line.group(1)] = re.sub(r"'{2,5}", r"", temp_line.group(2))

# 25.See Python3 as well as py
for k, v in sorted(temp_dict.items(), key=lambda x: x[1]):
    print(k, v)

comment

re.sub is a function that replaces the part that matches the regular expression. This time, I am writing to delete 2 or more and 5 or less 's. If you write {n, m}, you can express the previous character as n or more and m or less in a regular expression. ~~ Well, I feel like I should have removed all 's purely this time ... ~~

27. Removal of internal links

In addition to processing 26, remove MediaWiki's internal link markup from the template value and convert it to text (Reference: Markup Quick Reference Simplified chart)).

Answer

`27.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 27.py

import re
from mymodule import extract_from_json


def remove_markup(str):
    str = re.sub(r"'{2,5}", r"", str)
    str = re.sub(r"\[{2}([^|\]]+?\|)*(.+?)\]{2}", r"\2", str)
    return str

temp_dict = {}
lines = extract_from_json(u"England").split("\n")

for line in lines:
    category_line = re.search("^\|(.*?)\s=\s(.*)", line)
    if category_line is not None:
        temp_dict[category_line.group(1)] = remove_markup(category_line.group(2))

for k, v in sorted(temp_dict.items(), key=lambda x: x[0]):
    print(k, v)

comment

I have created a function remove_markup () that removes markup.

line number	Target to be removed
The first line	Emphasis (similar to 26)
2nd line	Internal link

How to write internal links

[[Article title]]
[[Article name | Display characters]]
[[Article name #Section name | Display characters]]

There are three types, but all of them comply with the rule that "the article name starts from [[ and ends with some symbol (]], |, #)". I wrote a regular expression.

28. Removal of MediaWiki markup

In addition to the> 27 process, remove MediaWiki markup from the template values as much as possible and format the basic country information.

Answer

`28.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 28.py

import re
from mymodule import extract_from_json


def remove_markup(str):
    str = re.sub(r"'{2,5}", r"", str)
    str = re.sub(r"\[{2}([^|\]]+?\|)*(.+?)\]{2}", r"\2", str)
    str = re.sub(r"\{{2}.+?\|.+?\|(.+?)\}{2}", r"\1 ", str)
    str = re.sub(r"<.*?>", r"", str)
    str = re.sub(r"\[.*?\]", r"", str)
    return str

temp_dict = {}
lines = extract_from_json(u"England").split("\n")

for line in lines:
    temp_line = re.search("^\|(.*?)\s=\s(.*)", line)
    if temp_line is not None:
        temp_dict[temp_line.group(1)] = remove_markup(temp_line.group(2))

for k, v in sorted(temp_dict.items(), key=lambda x: x[0]):
    print(k, v)

comment

In addition to 27

line number	Target to be removed
The first line	Emphasis (similar to 26)
2nd line	Internal link (same as 27)
3rd line	Notation with language specified (though not in the markup chart)
4th line	comment
5th line	External link

Rewrote remove_markup () to remove.

29. Get the URL of the national flag image

Use the contents of the template to get the URL of the national flag image. (Hint: Call imageinfo in the MediaWiki API to convert file references to URLs)

Answer

`29.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 29.py

import re
import requests
from mymodule import extract_from_json


def json_search(json_data):
    ret_dict = {}
    for k, v in json_data.items():
        if isinstance(v, list):
            for e in v:
                ret_dict.update(json_search(e))
        elif isinstance(v, dict):
            ret_dict.update(json_search(v))
        else:
            ret_dict[k] = v
    return ret_dict


def remove_markup(str):
    str = re.sub(r"'{2,5}", r"", str)
    str = re.sub(r"\[{2}([^|\]]+?\|)*(.+?)\]{2}", r"\2", str)
    str = re.sub(r"\{{2}.+?\|.+?\|(.+?)\}{2}", r"\1 ", str)
    str = re.sub(r"<.*?>", r"", str)
    str = re.sub(r"\[.*?\]", r"", str)
    return str

temp_dict = {}
lines = extract_from_json(u"England").split("\n")

for line in lines:
    temp_line = re.search("^\|(.*?)\s=\s(.*)", line)
    if temp_line is not None:
        temp_dict[temp_line.group(1)] = remove_markup(temp_line.group(2))

url = "https://en.wikipedia.org/w/api.php"
payload = {"action": "query",
           "titles": "File:{}".format(temp_dict[u"National flag image"]),
           "prop": "imageinfo",
           "format": "json",
           "iiprop": "url"}

json_data = requests.get(url, params=payload).json()

print(json_search(json_data)["url"])

comment

How can I hit the API in Python? When I looked it up, this was quite complicated ...

ʻUrllib` has been used in the past
ʻUrllib2` allows for complex settings
Integrated from Python 3 to become the new ʻurllib`
requests was developed because it is difficult to use
Behind requests is ʻurllib3`

... Well, from the conclusion, it seems that requests is recommended. Official documentation for Python 3

For a higher level http client interface, the Requests package is recommended.

Or, Official Documents for Requests

Requests: HTTP for humans (Omitted) Python's standard urllib2 module has most of the required HTTP functionality, but the API doesn't work properly.

Requests is recommended for strong text.

For details on how to use it, refer to the official document, but this time I received the result of hitting the API in JSON and processed it. The structure of the returned JSON was complicated, so I searched all over and checked the part where the URL was written.

in conclusion

Continue to Chapter 4.

100 Language Processing Knock with Python (Chapter 3)

Introduction

excuse

Chapter 3: Regular Expressions

20. Read JSON data

Answer

20.py

comment

modularization

extract_from_json.py

21. Extract rows containing category names

Answer

21.py

comment

22. Extraction of category name

Answer

22.py

comment

22.Execution result of py

23. Section structure

Answer

23.py

comment

24. Extracting file references

Answer

24.py

comment

25. Template extraction

Answer

25.py

comment

Official country name

26. Removal of highlighted markup

Answer

26.py

comment

27. Removal of internal links

Answer

27.py

comment

28. Removal of MediaWiki markup

Answer

28.py

comment

29. Get the URL of the national flag image

Answer

29.py

comment

in conclusion

`20.py`

`extract_from_json.py`

`21.py`

`22.py`

`22.Execution result of py`

`23.py`

`24.py`

`25.py`

`Official country name`

`26.py`

`27.py`

`28.py`

`29.py`