I tried 100 language processing knock 2020: Chapter 3

Introduction

I tried Language processing 100 knock 2020. You can see the link of other chapters from here, and the source code from here.

Chapter 3 Regular Expressions

No.20 Reading JSON data

Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.

Answer

020.py


import pandas as pd

path = "jawiki-country.json.gz"
df = pd.read_json(path, lines=True)
print(df.query("title == 'England'")["text"].values[0])

# -> {{redirect|UK}}
#    {{redirect|United Kingdom|The princes of the Spring and Autumn period|English(Spring and autumn)}}
#    {{Otheruses|European country|Local cuisine of Nagasaki and Kumamoto prefectures|England}}
Comments

Since the given file is in JSON Lines format, we set lines to True. You can get row elements that meet the conditions by using query (). You could also do df [df ["title "] =" UK "].

No.21 Extract the line containing the category name

Extract the line that declares the category name in the article.

Answer

021.py


import input_json as js

article = js.input()
print(list(filter(lambda x: "[[Category" in x, article)))

# -> ['[[Category:England|*]]', '[[Category:England連邦加盟国]]', '[[Category:Commonwealth Kingdom|*]]'...
Comments

The sentence extracted in No.20 can be called by ʻimportandjs.input (). Does the filterfunction return an iterator, right? I have the impression that the processing oflist (filter ())` does not fit well. I wonder if there is another implementation method.

No.22 Extraction of category name

Extract the article category names (by name, not line by line).

Answer

022.py



import input_json as js
import re

article = js.input()
list = [re.match("\[\[Category:([^\|,\]]+)", item) for item in article]
print([str.groups()[0] for str in list if str is not None])

# -> ['England', 'England連邦加盟国', 'Commonwealth Kingdom'...
Comments

I noticed that the title in this chapter was a regular expression, so I wrote it using it. Extracting the string enclosed in [[Category: and | or ]. It's like cryptography and it's fun. If you use groups (), you will get the contents of () in the regular expression as a tuple type, so I put them together in a list type.

No.23 Section structure

Display the section name and its level (for example, 1 if "== section name ==") included in the article.

Answer

023.py


import input_json as js
import re

article = js.input()
list = [re.search("(==+)([^=]*)", item) for item in article]
print([{item.group(2): f'Level {item.group(1).count("=")}'} for item in list if not item == None])

# -> [{'Country name': 'Level 2'}, {'history': 'Level 2'}...
Comments

I got the == + and [^ =] * parts of the regular expression with search (). Group (). The level of the section seemed to correspond to the number of =, so I used count () to put it together in a dict type.

No.24 Extraction of file reference

Extract all media files referenced from the article.

Answer

024.py


import input_json as js
import re

article = js.input()
list = [re.search("^\[\[File:([^|]*)", item) for item in article]
print([str.group(1) for str in list if not str==None])

# -> ['Descriptio Prime Tabulae Europae.jpg', "Lenepveu, Jeanne d'Arc au siège d'Orléans.jpg "...
Comments

If you look at the Markup Quick Reference, you can see that [[File: It feels like the part starting with is a media file. I also searched for a character string containing [[File:, but I excluded it because it seems unlikely.

No.25 Template extraction

Extract the field names and values of the "basic information" template included in the article and store them as a dictionary object.

Answer

025.py


import pandas as pd
import re

df = pd.read_json("jawiki-country.json.gz", lines=True)
article = df.query("title == 'England'")["text"].values[0]
str = re.findall("(?<=Basic information Country\n\|).*(?=\n\}\})", article,  flags=re.DOTALL)
ans = {}
for item in str[0].replace(" ", "").split("\n|"):
    kv = re.sub("\<ref\>.*\</ref\>", "", item, flags=re.DOTALL).split("=")
    ans[kv[0]] = kv[1]
print(ans)

# -> {'Abbreviated name': 'England', 'Japanese country name': 'United Kingdom of Great Britain and Northern Ireland',...
Comments

The part that starts with {{basic information and ends with\ n}}is matched by affirmative look-ahead and affirmative look-ahead, and split is done with \ n |. You need to set the re.DOTALL flag because you need to include \ n in the match statement. This problem took quite a while ...

No.26 Removal of emphasized markup

When processing> 25, remove MediaWiki's emphasized markup (all weak, emphasized, and strongly emphasized) from the template value and convert it to text (reference: markup quick reference table).

Answer

026.py


import pandas as pd
import re


def remove_quote(a: list):
    ans = {}
    for i in a:
        i = re.sub("'+", "", i, flags=re.DOTALL)
        i = re.sub("<br/>", "", i, flags=re.DOTALL).split("=")
        ans[i[0]]= i[1]
    return ans


df = pd.read_json("jawiki-country.json.gz", lines=True)
article = df.query("title == 'England'")["text"].values[0]
str = re.findall("(?<=Basic information Country\n\|).*(?=\n\}\})", article, flags=re.DOTALL)
list = [re.sub("\<ref\>.*\</ref\>", "", item, flags=re.DOTALL) for item in str[0].replace(" ", "").split("\n|")]
print(remove_quote(list))

# -> ...'Motto': '{{lang|fr|[[Dieuetmondroit]]}}([[French]]:[[Dieuetmondroit|God and my rights]])',...
Comments

Removed ' and <br /> from the output of No.25. It seems that the type can be specified in the argument of the function, so I tried using it.

No.27 Removal of internal links

In addition to processing> 26, remove MediaWiki's internal link markup from the template value and convert it to text (reference: markup quick reference table).

Answer

027.py


import pandas as pd
import re

p_quote = re.compile("'+")
p_br = re.compile("<br/>")
p_ref = re.compile("\<ref\>.*\</ref\>", re.DOTALL)
p_emphasis1 = re.compile("\[\[[^\]]*\|([^\|]*?)\]\]")
p_emphasis2 = re.compile("\[\[(.+?)\]\]")


def remove_markup(a: list):
    ans = {}
    for i in a:
        i = p_quote.sub("", i)
        i = p_br.sub("", i)
        i = p_emphasis1.sub(r"\1", i)
        if p_emphasis2.search(i):
            i = i.replace("[", "").replace("]", "")
        i = i.split("=")
        ans[i[0]] = i[1]
    return ans


df = pd.read_json("jawiki-country.json.gz", lines=True)
article = df.query("title == 'England'")["text"].values[0]
str = re.findall("(?<=Basic information Country\n\|).*(?=\n\}\})", article, flags=re.DOTALL)
list = [p_ref.sub("", item) for item in str[0].replace(" ", "").split("\n|")]
print(remove_markup(list))

# -> ...'Motto': '{{lang|fr|Dieuetmondroit}}(French:God and my rights)'...
Comments

[A]The one in the shape ofA,[A|...|B]The one in the shape ofBIs output. I also pre-compiled and used regular expressions. Gradually I don't understand regular expressions ..

Also, I thought after seeing the above answer, but if you know, please let me know if you can solve the phenomenon that the color scheme becomes strange when you embed a regular expression.

No.28 Removal of MediaWiki markup

In addition to processing> 27, remove MediaWiki markup from the template values as much as possible and format the basic country information.

Answer

028.py


import pandas as pd
import re

p_quote = re.compile("'+")
p_br = re.compile("<br/>")
p_ref = re.compile("\<ref\>.*\</ref\>", re.DOTALL)
p_emphasis1 = re.compile("\[\[[^\]]*\|([^\|]*?)\]\]")
p_emphasis2 = re.compile("\[\[(.+?)\]\]")
p_templete1 = re.compile("\{\{\d\}\}")
p_templete2 = re.compile("\{\{.*\|([^\|]*?)\}\}")
p_refname = re.compile("<refname.*")


def remove_markup(a: list):
    ans = {}
    for i in a:
        i = p_quote.sub("", i)
        i = p_br.sub("", i)
        i = p_emphasis1.sub(r"\1", i)
        if p_emphasis2.search(i):
            i = i.replace("[", "").replace("]", "")
        i = p_templete1.sub("", i)
        i = p_templete2.sub(r"\1", i)
        i = p_refname.sub("", i)
        i = re.sub("((National emblem))", r"\1", i)
        i = re.sub("\}\}File.*", "", i).split("=")
        ans[i[0]] = i[1]
    return ans


df = pd.read_json("jawiki-country.json.gz", lines=True)
article = df.query("title == 'England'")["text"].values[0]
str = re.findall("(?<=Basic information Country\n\|).*(?=\n\}\})", article, flags=re.DOTALL)
list = [p_ref.sub("", item) for item in str[0].replace(" ", "").split("\n|")]
print(remove_markup(list))

# -> ...'Motto': 'Dieuetmondroit (French:God and my rights)'...
Comments

I tried removing the markup from one end. But this way of writing doesn't seem to apply well to articles outside the UK. (For example, in the Singapore article, the parameters of the national flag width could not be extracted.)

No.29 Get the URL of the national flag image

Use the contents of the template to get the URL of the national flag image. (Hint: Call imageinfo in the MediaWiki API to convert file references to URLs)

Answer

029.py


import requests
import pandas as pd
import re

p_quote = re.compile("'+")
p_br = re.compile("<br />")
p_ref = re.compile("\<ref\>.*\</ref\>", re.DOTALL)
p_emphasis1 = re.compile("\[\[[^\]]*\|([^\|]*?)\]\]")
p_emphasis2 = re.compile("\[\[(.+?)\]\]")
p_templete1 = re.compile("\{\{\d\}\}")
p_templete2 = re.compile("\{\{.*\|([^\|]*?)\}\}")
p_refname = re.compile("<ref name.*")


def remove_markup(a: list):
    ans = {}
    for i in a:
        i = p_quote.sub("", i)
        i = p_br.sub("", i)
        i = p_emphasis1.sub(r"\1", i)
        if p_emphasis2.search(i):
            i = i.replace("[", "").replace("]", "")
        i = p_templete1.sub("", i)
        i = p_templete2.sub(r"\1", i)
        i = p_refname.sub("", i)
        i = re.sub("((National emblem))", r"\1", i)
        i = re.sub("\}\}File.*", "", i).split("=")
        i[0] = re.sub("^\s", "", i[0])
        i[0] = re.sub("\s$", "", i[0])
        i[1] = re.sub("^\s", "", i[1])
        i[1] = re.sub("\s$", "", i[1])
        ans[i[0]] = i[1]
    return ans


df = pd.read_json("jawiki-country.json.gz", lines=True)
article = df.query("title == 'England'")["text"].values[0]
str = re.findall("(?<=Basic information Country\n\|).*(?=\n\}\})", article, flags=re.DOTALL)
list = [p_ref.sub("", item) for item in str[0].split("\n|")]
page = remove_markup(list)

print(page["National flag image"])
url = 'https://en.wikipedia.org/w/api.php'
PARAMS = {
    "action": "query",
    "format": "json",
    "prop": "imageinfo",
    "iiprop": "url",
    "titles": "File:" + page["National flag image"]
}
response = requests.get(url, params=PARAMS)
data = response.json()
for k, v in data["query"]["pages"].items():
    print(f"{v['imageinfo'][0]['url']}")

# -> https://upload.wikimedia.org/wikipedia/en/a/ae/Flag_of_the_United_Kingdom.svg
Comments

Actually, I wanted to make No.28 up to ʻimport`, but I changed the part of Markup removal a little, so I posted all the lines.

I am sending a GET request using the request module. Imageinfo of MediaWiki API was very helpful.

Recommended Posts

I tried 100 language processing knock 2020: Chapter 3
I tried 100 language processing knock 2020: Chapter 1
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
I tried 100 language processing knock 2020
100 Language Processing Knock 2020 Chapter 1
100 Language Processing Knock Chapter 1
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 10 to 14]
I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 15 to 19]
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock (2020): 28
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
Python beginner tried 100 language processing knock 2015 (05 ~ 09)
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
100 Language Processing Knock Chapter 1 by Python
100 Language Processing Knock 2020 Chapter 3: Regular Expressions
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
Python beginner tried 100 language processing knock 2015 (00 ~ 04)
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
[Programmer newcomer "100 language processing knock 2020"] Solve Chapter 1
I tried natural language processing with transformers.
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Amateur Language Processing Knock: 17
100 Language Processing Knock-52: Stemming
100 language processing knocks ~ Chapter 1
100 Amateur Language Processing Knock: 07
100 language processing knocks Chapter 2 (10 ~ 19)
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing Knock UNIX Commands Learned in Chapter 2
100 Language Processing Knock Regular Expressions Learned in Chapter 3
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
[I tried] Nand2 Tetris Chapter 6