About the history so far

Please refer to First Post

Knock status

9/24 added

Chapter 3: Regular Expressions

There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format. ・ Information of one article per line is stored in JSON format -In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. -The entire file is compressed with gzip Create a program that performs the following processing.

020. Reading JSON data

Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.

`json_read_020.py`


#-*- coding:utf-8 -*-

import json
import gzip
import re

def uk_find():
    basepath = '/Users/masassy/PycharmProjects/Pywork/training/'
    filename = 'jawiki-country.json.gz'
    pattern = r"England"
    with gzip.open(basepath + filename, 'rt')as gf:
        for line in gf:
            # json.loads is str → dict, json.load is file → dict
            json_data = json.loads(line)
            if (re.match(json_data['title'], pattern)):
                return json_data['text']

if __name__ == "__main__":
    json_data = uk_find()
    print(json_data)

`result`


{{redirect|UK}}
{{Basic information Country
|Abbreviated name=England

(Omitted because it is long)

[[Category:Island country|Kureito Furiten]]
[[Category:States / Regions Established in 1801]]

Process finished with exit code 0

Impressions: It took me some time to understand the data format of the file read by gzip.open and the data format of json_data.

021. Extract rows containing category names

Extract the line that declares the category name in the article.

`category_021.py`


from training.json_read_020 import uk_find
import re

if __name__=="__main__":
    pattern = re.compile(r'.*Category.*')
    lines = uk_find()
    for line in lines.split('\n'):
        if pattern.match(line):
            print(line)

`result`


[[Category:England|*]]
[[Category:Commonwealth Kingdom|*]]
[[Category:G8 member countries]]
[[Category:European Union member states]]
[[Category:Maritime nation]]
[[Category:Sovereign country]]
[[Category:Island country|Kureito Furiten]]
[[Category:States / Regions Established in 1801]]

Process finished with exit code 0

Impressions: It took me a while to realize that the regular expression. \ * Search character. \ * And lines.split ('\ n') combination can return lines that contain the search character.

022. Extraction of category name

Extract the article category names (by name, not line by line).

`category_str_022.py`


from training.json_read_020 import uk_find
import re

if __name__=="__main__":
    pattern = re.compile(r'.*Category:.*')
    pattern2 = re.compile(r'.*\|.*')
    lines = uk_find()
    for line in lines.split('\n'):
        if pattern.match(line):
            strip_line=line.lstrip("[[Category:").rstrip("]]")
            if pattern2.match(strip_line):
                N = strip_line.find('|')
                strip_line2 = strip_line[:N]
                print(strip_line2)
            else:
                print(strip_line)

`result`


England
Commonwealth Kingdom
G8 member countries
European Union member states
Maritime nation
Sovereign country
Island country
States / Regions Established in 1801

Process finished with exit code 0

Impression: After extracting Category, it is in the line|If there is|The point of ingenuity is the part that specifies the place to slice at the position of.

023. Section structure

Display the section name and its level contained in the article (for example, 1 if "== section name ==").

`section_023.py`


import re
from training.json_read_020 import uk_find

if __name__=="__main__":
    pattern = re.compile(r'^=.*')
    pattern2 = re.compile(r'^={2}')
    pattern3 = re.compile(r'^={3}')
    pattern4 = re.compile(r'^={4}')

    lines=uk_find()
    for line in lines.split('\n'):
        if pattern.match(line):
            if pattern4.match(line):
                print(line.lstrip('====').rstrip('====')+':Level 4')
            elif pattern3.match(line):
                print(line.lstrip('===').rstrip('===')+':Level 3')
            elif pattern2.match(line):
                print(line.lstrip('==').rstrip('==')+':Level 2')
            else:
                print('no match')

`result`


Country name:Level 2
history:Level 2
Geography:Level 2
climate:Level 3
(Omitted because it is long)

Process finished with exit code 0

Impressions: After creating four compilation patterns, I first extracted the lines starting with =, and then branched the process according to the level, but it was quite forcible. There seems to be another good way.

024. Extracting file references

Extract all the media files referenced in the article.

`media_024.py`


from training.json_read_020 import uk_find
import re

if __name__=="__main__":
    pattern = re.compile(r".*(File|File).*")
    pattern2 = re.compile(r"^|.*")
    lines = uk_find()
    for line in lines.split('\n'):
        if pattern2.search(line):
            line = line.lstrip('|')
        if pattern.search(line):
            start = line.find(':')+1
            end = line.find('|')
            print(line[start:end])

`result`


Royal Coat of Arms of the United Kingdom.svg
Battle of Waterloo 1815.PNG
The British Empire.png
Uk topo en.jpg
BenNevis2005.jpg
(Omitted because it is long)

Impressions: Markup Quick Reference to determine slice position The point of ingenuity is to carry out processing that matches the format of the file part from

[Python] Challenge 100 knocks! (020-024)

About the history so far

Knock status

Chapter 3: Regular Expressions

020. Reading JSON data

json_read_020.py

result

021. Extract rows containing category names

category_021.py

result

022. Extraction of category name

category_str_022.py

result

023. Section structure

section_023.py

result

024. Extracting file references

media_024.py

result

`json_read_020.py`

`result`

`category_021.py`

`result`

`category_str_022.py`

`result`

`section_023.py`

`result`

`media_024.py`

`result`