Knock 100 Language Processing published on the Tohoku University Inui-Okazaki Laboratory Web page for training in natural language processing and Python. I will challenge nlp100 /). I plan to make a note of the code implemented in it and the techniques that should be suppressed. The code will also be published on GitHub.
-Chapter 1 -Chapter 2, Part 1 -Chapter 2, Part 2 is continued.
Thanks to skipping for a while, I ended up writing an article while reading the code I wrote before. A style of going "I am another person three days ago" on the ground. During that time, my proficiency level changed considerably, and I was looking at my code while talking to each other. There is a gap between updates, but I hope you can use it as a stone from another mountain.
There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format.
One article information per line is stored in JSON format In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. The entire file is gzipped Create a program that performs the following processing.
Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.
20.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 20.py
import json
with open("jawiki-country.json") as f:
article_json = f.readline()
while article_json:
article_dict = json.loads(article_json)
if article_dict["title"] == u"England":
print(article_dict["text"])
article_json = f.readline()
The jawiki-country.json.gz used this time is 9.9MB, which is quite heavy, so I read it line by line with readline ()
and only the "UK" article is print
(other articles are through).
I feel that the operation will stop for a while if I do readlines ()
, and if there is a wider range of uses, I will operate only for "UK" articles, so I implemented it like this.
In this text data, each line of the file is described in JSON format. However, just loading (json.load ()
) does not work well and the advantage of JSON cannot be utilized, so I used json.loads ()
to convert it to JSON format (this time it is a real dictionary). I will.
From here, the work of extracting only the articles of "UK" will continue for a while, so I modularized it as follows.
extract_from_json.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# extract_from_json.py
import json
def extract_from_json(title):
with open("jawiki-country.json") as f:
json_data = f.readline()
while json_data:
article_dict = json.loads(json_data)
if article_dict["title"] == title:
return article_dict["text"]
else:
json_data = f.readline()
return ""
Unlike 20.py
, this function will return the string of the article if you pass the title as an argument (empty string if it doesn't exist).
Extract the line that declares the category name in the article.
21.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 21.py
from mymodule import extract_from_json
lines = extract_from_json(u"England").split("\n")
for line in lines:
if "Category" in line:
print(line)
#With python3, this can still be displayed (although it is a list)
# print([line for line in lines if "Category" in line])
It's a regular expression chapter, but it doesn't use regular expressions. Well, this one is easier to understand ...
That's why only the lines containing the string "Category" are print
.
If you write it in intensional notation, it will fit neatly, but in Python2, if you just print
a list containing Unicode strings, it will be escaped. Therefore, it is not displayed in a form that can be read as Japanese.
This code can be executed in Python3, so if you execute it in Python3, it will be processed well.
Extract the article category names (by name, not line by line).
22.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 22.py
import re
from mymodule import extract_from_json
lines = extract_from_json(u"England").split("\n")
for line in lines:
category_line = re.search("^\[\[Category:(.*?)(|\|.*)\]\]$", line)
if category_line is not None:
print(category_line.group(1))
First, extract the category line as in 21., and extract only the name from it using re.search ()
.
re.search ()
returns aMatchObject
instance if there is a part of the string specified by the 2nd argument that matches the regular expression pattern of the 1st argument.
Aside from what MatchObject
looks like, you can use.group ()
to get the matched string.
In this case, category_line.group (0)
has the entire matched string (eg " [[Category: UK | *]] "
), but category_line.group (1)
has the first matched part. You will get a string (eg UK
).
And although it is an important regular expression, the details are thrown to Official Document, and specific adaptation examples are followed on this page. I will like to try. Click here for the category line to be processed this time (execution result of 21.py)
22.Execution result of py
$ python 22.py
[[Category:England|*]]
[[Category:Commonwealth Kingdom|*]]
[[Category:G8 member countries]]
[[Category:European Union member states]]
[[Category:Maritime nation]]
[[Category:Sovereign country]]
[[Category:Island country|Kureito Furiten]]
[[Category:States / Regions Established in 1801]]
Basically, it is [[Category: category name]]
, but there are some that specify reading by separating with |
. So as a policy,
--First, it starts with [[Category:
--Some kind of character string (category name) comes
--In some cases, reading kana separated by |
comes
--Finally, close with ]]
It will be in the form of.
Expressing this as a regular expression (I'm not sure if it's optimal) is " ^ \ [\ [Category: (. *?) (\ |. *) * \] \] $ "
.
Intention | Actual regular expression | Commentary |
---|---|---|
First[[Category: Begins with |
^\[\[Category: |
^Specify the beginning with |
Some kind of character string (category name) comes | (.*?) |
Shortest match with any string |
In some cases| The reading kana separated by |
(\|.*)* |
(\|.*)*? May be more appropriate |
Finally]] Tighten with |
\]\]$ |
Indicates the end$ May not be necessary |
Display the section name and its level contained in the article (for example, 1 if "== section name ==").
23.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 23.py
import re
from mymodule import extract_from_json
lines = extract_from_json(u"England").split("\n")
for line in lines:
section_line = re.search("^(=+)\s*(.*?)\s*(=+)$", line)
if section_line is not None:
print(section_line.group(2), len(section_line.group(1)) - 1)
The basic structure is the same as 22., but this time the section name (e.g. == section ==) is the target, so we will pick it up.
Since there was a slight fluctuation in the notation (== section ==, == section ==), a space character \ s
is inserted between them so that it can be absorbed.
Since the section level corresponds to the length of ==
(== 1 ==, === 2 ===, ...), it is calculated by getting the length and -1. It is.
Extract all the media files referenced from the article.
24.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 24.py
import re
from mymodule import extract_from_json
lines = extract_from_json(u"England").split("\n")
for line in lines:
file_line = re.search(u"(File|File):(.*?)\|", line)
if file_line is not None:
print(file_line.group(2))
Initially, only those starting with File:
were extracted ... stupid.
It is ʻUnicodebecause Japanese is included in the regular expression pattern, but it seems that it is allowed in the Python regular expression pattern. I often see examples of raw strings as
r" hogehoge ", but at least it's not a must, as it prevents the escaping process from duplicating and becoming hard to read? Furthermore, if you want to reuse regular expression patterns repeatedly, it seems more efficient to compile using
re.compile ()`. However, the last used regular expression pattern is cached, so you don't need to worry about it this time.
Extract the field names and values of the "basic information" template included in the article and store them as a dictionary object.
25.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 25.py
import re
from mymodule import extract_from_json
temp_dict = {}
lines = re.split(r"\n[\|}]", extract_from_json(u"England"))
for line in lines:
temp_line = re.search("^(.*?)\s=\s(.*)", line, re.S)
if temp_line is not None:
temp_dict[temp_line.group(1)] = temp_line.group(2)
for k, v in sorted(temp_dict.items(), key=lambda x: x[1]):
print(k, v)
The ~~ template is included in the form | template name = template content
, so it is a regular expression that matches it.
As mentioned above, if you write ^ \ | (. *?) \ S = \ s (. *)
, The first parenthesis is the template name and the second parenthesis is the template content, so put it in the dictionary. It is stored. ~~
Basically, the template is stored ** in each line ** in the form of | template name = template contents
, but the official country name was a little troublesome.
Official country name
|Official country name= {{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br/>
*{{lang|gd|An Rìoghachd Aonaichte na Breatainn Mhòr agus Eirinn mu Thuath}}([[Scottish Gaelic]])<br/>
*{{lang|cy|Teyrnas Gyfunol Prydain Fawr a Gogledd Iwerddon}}([[Welsh]])<br/>
*{{lang|ga|Ríocht Aontaithe na Breataine Móire agus Tuaisceart na hÉireann}}([[Irish]])<br/>
*{{lang|kw|An Rywvaneth Unys a Vreten Veur hag Iwerdhon Glédh}}([[Cornish]])<br/>
*{{lang|sco|Unitit Kinrick o Great Breetain an Northren Ireland}}([[Scots]])<br/>
**{{lang|sco|Claught Kängrick o Docht Brätain an Norlin Airlann}}、{{lang|sco|Unitet Kängdom o Great Brittain an Norlin Airlann}}(Ulster Scots)</ref>
As mentioned above, it spans multiple lines including line breaks (character = \ n
), so it is necessary to handle this area well.
After all
split ()
, use \ n |
or \ n}
instead of \ n
(re.split ()
)split ()
}
is involved because |
does not appear at the very end.re.S
to dore.search ()
including \ n
|
is blown away by split ()
, consider it with search ()
I made it through various trials and errors.
For the time being, I use for loop
to print
to check the contents, but Python3 is also recommended. Python3 is useful for some reason ...
At the time of processing> 25, remove the MediaWiki emphasis markup (all of weak emphasis, emphasis, and strong emphasis) from the template value and convert it to text (Reference: [Markup Quick Reference](https: // ja. wikipedia.org/wiki/Help: quick reference table)).
26.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 26.py
import re
from mymodule import extract_from_json
temp_dict = {}
lines = re.split(r"\n[\|}]", extract_from_json(u"England"))
for line in lines:
temp_line = re.search("^(.*?)\s=\s(.*)", line, re.S)
if temp_line is not None:
temp_dict[temp_line.group(1)] = re.sub(r"'{2,5}", r"", temp_line.group(2))
# 25.See Python3 as well as py
for k, v in sorted(temp_dict.items(), key=lambda x: x[1]):
print(k, v)
re.sub
is a function that replaces the part that matches the regular expression.
This time, I am writing to delete 2 or more and 5 or less '
s.
If you write {n, m}
, you can express the previous character as n or more and m or less in a regular expression.
~~ Well, I feel like I should have removed all '
s purely this time ... ~~
In addition to processing 26, remove MediaWiki's internal link markup from the template value and convert it to text (Reference: Markup Quick Reference Simplified chart)).
27.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 27.py
import re
from mymodule import extract_from_json
def remove_markup(str):
str = re.sub(r"'{2,5}", r"", str)
str = re.sub(r"\[{2}([^|\]]+?\|)*(.+?)\]{2}", r"\2", str)
return str
temp_dict = {}
lines = extract_from_json(u"England").split("\n")
for line in lines:
category_line = re.search("^\|(.*?)\s=\s(.*)", line)
if category_line is not None:
temp_dict[category_line.group(1)] = remove_markup(category_line.group(2))
for k, v in sorted(temp_dict.items(), key=lambda x: x[0]):
print(k, v)
I have created a function remove_markup ()
that removes markup.
line number | Target to be removed |
---|---|
The first line | Emphasis (similar to 26) |
2nd line | Internal link |
How to write internal links
There are three types, but all of them comply with the rule that "the article name starts from [[
and ends with some symbol (]]
, |
, #
)". I wrote a regular expression.
In addition to the> 27 process, remove MediaWiki markup from the template values as much as possible and format the basic country information.
28.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 28.py
import re
from mymodule import extract_from_json
def remove_markup(str):
str = re.sub(r"'{2,5}", r"", str)
str = re.sub(r"\[{2}([^|\]]+?\|)*(.+?)\]{2}", r"\2", str)
str = re.sub(r"\{{2}.+?\|.+?\|(.+?)\}{2}", r"\1 ", str)
str = re.sub(r"<.*?>", r"", str)
str = re.sub(r"\[.*?\]", r"", str)
return str
temp_dict = {}
lines = extract_from_json(u"England").split("\n")
for line in lines:
temp_line = re.search("^\|(.*?)\s=\s(.*)", line)
if temp_line is not None:
temp_dict[temp_line.group(1)] = remove_markup(temp_line.group(2))
for k, v in sorted(temp_dict.items(), key=lambda x: x[0]):
print(k, v)
In addition to 27
line number | Target to be removed |
---|---|
The first line | Emphasis (similar to 26) |
2nd line | Internal link (same as 27) |
3rd line | Notation with language specified (though not in the markup chart) |
4th line | comment |
5th line | External link |
Rewrote remove_markup ()
to remove.
Use the contents of the template to get the URL of the national flag image. (Hint: Call imageinfo in the MediaWiki API to convert file references to URLs)
29.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 29.py
import re
import requests
from mymodule import extract_from_json
def json_search(json_data):
ret_dict = {}
for k, v in json_data.items():
if isinstance(v, list):
for e in v:
ret_dict.update(json_search(e))
elif isinstance(v, dict):
ret_dict.update(json_search(v))
else:
ret_dict[k] = v
return ret_dict
def remove_markup(str):
str = re.sub(r"'{2,5}", r"", str)
str = re.sub(r"\[{2}([^|\]]+?\|)*(.+?)\]{2}", r"\2", str)
str = re.sub(r"\{{2}.+?\|.+?\|(.+?)\}{2}", r"\1 ", str)
str = re.sub(r"<.*?>", r"", str)
str = re.sub(r"\[.*?\]", r"", str)
return str
temp_dict = {}
lines = extract_from_json(u"England").split("\n")
for line in lines:
temp_line = re.search("^\|(.*?)\s=\s(.*)", line)
if temp_line is not None:
temp_dict[temp_line.group(1)] = remove_markup(temp_line.group(2))
url = "https://en.wikipedia.org/w/api.php"
payload = {"action": "query",
"titles": "File:{}".format(temp_dict[u"National flag image"]),
"prop": "imageinfo",
"format": "json",
"iiprop": "url"}
json_data = requests.get(url, params=payload).json()
print(json_search(json_data)["url"])
How can I hit the API in Python? When I looked it up, this was quite complicated ...
requests
was developed because it is difficult to userequests
is ʻurllib3`... Well, from the conclusion, it seems that requests
is recommended.
Official documentation for Python 3
For a higher level http client interface, the Requests package is recommended.
Or, Official Documents for Requests
Requests: HTTP for humans (Omitted) Python's standard urllib2 module has most of the required HTTP functionality, but the API doesn't work properly.
Requests
is recommended for strong text.
For details on how to use it, refer to the official document, but this time I received the result of hitting the API in JSON and processed it. The structure of the returned JSON was complicated, so I searched all over and checked the part where the URL was written.
Continue to Chapter 4.
Recommended Posts