About the history so far

Please refer to First Post

Knock status

9/24 added

Chapter 3: Regular Expressions

There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format. ・ Information of one article per line is stored in JSON format -In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. -The entire file is compressed with gzip Create a program that performs the following processing.

025. Template extraction

Extract the field names and values of the "basic information" template included in the article and store them as a dictionary object.

`basic_info_025.py`


from training.json_read_020 import uk_find
import re

def basic_info_find(lines):
    pattern1 = re.compile(r'^\{\{[redirect|Basic information].*')
    pattern2 = re.compile(r'^\|.*')
    pattern3 = re.compile(r'^\}\}$')

    basic_dict = {}
    for line in lines.split('\n'):
        if pattern1.match(line):
            continue

        elif pattern2.match(line):
            point = line.find('=')
            MAX = len(line)
            title = line[0:point].lstrip('|').rstrip(' ')
            data = line[point:MAX].lstrip('= ')
            basic_dict.update({title: data})

        elif pattern3.match(line):
            break
    return basic_dict

if __name__=="__main__":
    lines = uk_find()
    basic_dict = basic_info_find(lines)
    for key,value in basic_dict.items():
        print(key+':'+value)

`result`


Established form 4:Current country name "'''United Kingdom of Great Britain and Northern Ireland'''"change to
National emblem image:[[File:Royal Coat of Arms of the United Kingdom.svg|85px|British coat of arms]]
National emblem link:（[[British coat of arms|National emblem]]）
(Omitted because it is long)
Process finished with exit code 0

Impression: I extracted the line starting with | from the result of the basic information and turned the loop to store it in the key and value of the dictionary before and after =. The print result was processed so that it is easy to understand.

026. Removal of highlighted markup

At the time of processing 25, remove MediaWiki's emphasized markup (all of weak emphasis, emphasis, and strong emphasis) from the template value and convert it to text (reference: markup quick reference table).

`emphasize_remove_026.py`


from training.json_read_020 import uk_find
from training.basic_info_025 import basic_info_find
import re

def emphasize_remove(basic_dict):
    pattern = re.compile(r".*'{2,4}.*")
    for key,value in basic_dict.items():
        if pattern.match(value):
            value = value.replace("\'",'')
            basic_dict.update({key:value})
    return basic_dict


if __name__ == "__main__":
    lines = uk_find()
    basic_dict = basic_info_find(lines)
    emphasize_remove_dict = emphasize_remove(basic_dict)
    for key,value in emphasize_remove_dict.items():
        print(key+':'+value)

`result`


GDP statistics year yuan:2012
Established form 4:Changed to the current country name "United Kingdom of Great Britain and Northern Ireland"
Area size:1 E11
(Omitted because it is long)
Process finished with exit code 0

Impressions: There was only one relevant part, but it is set to'{2,4} so that all emphasized markup can be searched. When I found it, I just replaced it with replace.

027. Removal of internal links

In addition to the 26 processes, remove MediaWiki's internal link markup from the template value and convert it to text (reference: markup quick reference table).

`link_remove_027.py`


from training.json_read_020 import uk_find
from training.basic_info_025 import basic_info_find
from training.emphasize_remove_026 import emphasize_remove
import re

def link_remove(emphasize_remove_dict):
    pattern = re.compile(r".*\[{2}.*")
    for key,value in emphasize_remove_dict.items():
        if pattern.match(value):
            value = value.replace('[[','').replace(']]','')
            emphasize_remove_dict.update({key: value})
    return emphasize_remove_dict

if __name__=="__main__":
    lines = uk_find()
    basic_dict = basic_info_find(lines)
    emphasize_remove_dict=emphasize_remove(basic_dict)
    link_remove_dict = link_remove(emphasize_remove_dict)

    for key,value in link_remove_dict.items():
        print(key+':'+value)

`result`


National emblem image:File:Royal Coat of Arms of the United Kingdom.svg|85px|British coat of arms
Official country name:{{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br/>
Founding form:Founding of the country
(Omitted because it is long)
Process finished with exit code 0

Impressions: Similar to problem 026, I just replaced [[and]] with replace when I found the internal link part starting with [[].

028. MediaWiki markup removal

In addition to the 27 processes, remove MediaWiki markup from the template values as much as possible and format the basic country information.

`markup_remove_028.py`


from training.json_read_020 import uk_find
from training.basic_info_025 import basic_info_find
from training.emphasize_remove_026 import emphasize_remove
from training.link_remove_027 import link_remove
import re

#A function that removes pounds.
def pound_check(value):
    pattern = re.compile(r".*pound.*")
    if pattern.match(value):
        value = value.replace("(&pound;)",'')
        return value
    else:
        return  value

#A function that removes the br tag.
def br_check(value):
    pattern1 = re.compile(r".*<br.*")
    if pattern1.match(value):
        value = value.replace("<br />", '').replace("<br/>", '')
        return value
    else:
        return value

#A function that removes the ref tag and reference description.
def ref_check(value):
    pattern2 = re.compile(r".*<ref.*")
    if pattern2.match(value):
        start_point = value.find("<ref")
        value = value[0:start_point]
        return value
    else:
        return value

#{{When}}A function that removes.
def brackets_check(value):
    pattern3 = re.compile(r".*\{\{.*")
    if pattern3.match(value):
        value = value.replace("{{","").replace("}}","")
        #lang|en|Get 4 characters or more from the first pipe when United ~#
        start_point = value.find("|")+4
        value = value[start_point:len(value)]
        return value
    else:
        return value

#File: Function to remove.
def file_check(value):
    pattern4 = re.compile(r".*File.*")
    if pattern4.match(value):
        value = value.replace('File:','')
        start_point = value.find("|")
        value = value[0:start_point]
        return value
    else:
        return value

#Half-width|A function that removes.|Only with|+()Removes the existing pattern.
def pipe_check(value):
     pattern5 = re.compile(r".*\|.*")
     pattern6 = re.compile(r".*\(.*")
     if pattern5.match(value) and pattern6.match(value) :
         end_point = value.find("|")
         value = value[0:end_point] + ")"
         return value
     elif pattern5.match(value):
         end_point = value.find("|")
         value = value[0:end_point]
         return value
     else:
         return value

#Full-width (removing function
def other_check(value):
    pattern7 = re.compile(r"^\（")
    if pattern7.match(value):
        value = value.replace("（","")
        return value
    else:
        return value

def markup_remove(link_remove_dict):
    for key,value in link_remove_dict.items():
        value = pound_check(value)
        value = br_check(value)
        value = ref_check(value)
        value = brackets_check(value)
        value = file_check(value)
        value = pipe_check(value)
        value = other_check(value)
        link_remove_dict.update({key:value})

    return link_remove_dict


if __name__=="__main__":
    lines = uk_find()
    basic_dict = basic_info_find(lines)
    emphasize_remove_dict=emphasize_remove(basic_dict)
    link_remove_dict = link_remove(emphasize_remove_dict)
    markup_remove_dict = markup_remove(link_remove_dict)

    for key,value in markup_remove_dict.items():
        print(key+':'+value)

    print(len(markup_remove_dict.items()))

`result`


Date of establishment 1:927/843
Official country name:United Kingdom of Great Britain and Northern Ireland
Established form 1:Kingdom of England / Kingdom of Scotland (Both countries are Acts of Union)(1707))
Position image:Location_UK_EU_Europe_001.svg
Motto:Dieu et mon droit (French:God and my rights)
ccTLD:.uk / .gb
National flag image:Flag of the United Kingdom.svg
currency:Sterling pound
(Omitted because it is long)
Process finished with exit code 0

Impressions: First, I found the markup, made a compile pattern, and repeated to see what kind of markup was caught. .. .. And I decided to evaluate all patterns line by line. However, full-width notation is pear. .. .. I was really wondering why it didn't get caught. .. .. I'm tired.

029. Get the URL of the national flag image

Use the contents of the template to get the URL of the national flag image. (Hint: Call imageinfo in the MediaWiki API to convert file references to URLs)

`get_url_029.py`


# -*- coding:utf-8-*-

from training.json_read_020 import uk_find
from training.basic_info_025 import basic_info_find
import requests
import urllib.parse
import json
import re

def image_query(filename):
    url = "https://commons.wikimedia.org/w/api.php?"
    action = "action=query&"
    titles = "titles=File:"+urllib.parse.quote(filename)+"&"
    prop = "prop=imageinfo&"
    iiprop="iiprop=url&"
    format = "format=json"
    parameter = url +action+titles+prop+iiprop+format
    return parameter

def get_request(parameter):
    pattern = re.compile(r".*\"url\".*")
    r = requests.get(parameter)
    data = r.json()
    json_data =json.dumps(data["query"]["pages"]["347935"]["imageinfo"],indent=4)
    for temp in json_data.split('\n'):
        if(pattern.search(temp)):
            url_data = temp.replace(" ","")
        else:
            continue

    return url_data

if __name__=="__main__":
    lines = uk_find()
    basic_dict = basic_info_find(lines)
    parameter=image_query(basic_dict['National flag image'])
    get_url = get_request(parameter)
    print(get_url)

`result`


"url":"https://upload.wikimedia.org/wikipedia/commons/a/ae/Flag_of_the_United_Kingdom.svg"

Process finished with exit code 0

Impressions: At first, I wasn't sure what to do. After googled variously, the point was to send a request to wikimedia to search for data related to the file name, and find the URL where the image file is uploaded from the response. It took me a long time to understand this subject ... It was a problem that I learned in many ways.

[Python] Challenge 100 knocks! (025-029)

About the history so far

Knock status

Chapter 3: Regular Expressions

025. Template extraction

basic_info_025.py

result

026. Removal of highlighted markup

emphasize_remove_026.py

result

027. Removal of internal links

link_remove_027.py

result

028. MediaWiki markup removal

markup_remove_028.py

result

029. Get the URL of the national flag image

get_url_029.py

result

`basic_info_025.py`

`result`

`emphasize_remove_026.py`

`result`

`link_remove_027.py`

`result`

`markup_remove_028.py`

`result`

`get_url_029.py`

`result`