100 language processing knock-29: Get the URL of the national flag image

Language processing 100 knocks 2015 "Chapter 3: Regular expressions" It is a record of 29th "Getting the URL of the national flag image" of .ac.jp/nlp100/#ch3). This time, we will extract the value of a specific item from the dictionary created using regular expressions and throw it to the Web service.

Reference link

Link Remarks
029.Get the URL of the national flag image.ipynb Answer program GitHub link
100 amateur language processing knocks:29 Copy and paste source of many source parts
Python regular expression basics and tips to learn from scratch I organized what I learned in this knock
Regular expression HOWTO Python Official Regular Expression How To
re ---Regular expression operation Python official re package description
Help:Simplified chart Wikipediaの代表的なマークアップのSimplified chart

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip. I was wondering whether to use the requests package, but I didn't use it because it was so simple that I didn't need to use it. If you use it, the code should be a little shorter.

type version
pandas 0.25.3

Chapter 3: Regular Expressions

content of study

By applying regular expressions to the markup description on Wikipedia pages, various information and knowledge can be extracted.

Regular Expressions, JSON, Wikipedia, InfoBox, Web Services

Knock content

File jawiki-country.json.gz that exports Wikipedia articles in the following format There is.

--One article information per line is stored in JSON format --In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. --The entire file is gzipped

Create a program that performs the following processing.

29. Get the URL of the national flag image

Use the contents of the template to get the URL of the national flag image. (Hint: imageinfo in MediaWiki API Call it and convert the file reference to a URL)

Problem supplement (About "MediaWiki API")

I referred to the following two links regarding the MediaWiki API. MediaWiki API: Main page of API imageinfo: Explanation of samples and parameters

Answer

Answer program [029. Get the URL of the national flag image.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/03.%E6%AD%A3%E8%A6%8F%E8%A1 % A8% E7% 8F% BE / 029.% E5% 9B% BD% E6% 97% 97% E7% 94% BB% E5% 83% 8F% E3% 81% AEURL% E3% 82% 92% E5% 8F% 96% E5% BE% 97% E3% 81% 99% E3% 82% 8B.ipynb)

from collections import OrderedDict
import json
import re
from urllib import request, parse

import pandas as pd

def extract_by_title(title):
    df_wiki = pd.read_json('jawiki-country.json', lines=True)
    return df_wiki[(df_wiki['title'] == title)]['text'].values[0]

wiki_body = extract_by_title('England')

basic = re.search(r'''
                    ^\{\{Basic information.*?\n  #Search term(\Is escape processing), Non-capture, non-greedy
                    (.*?)              #Arbitrary string
                    \}\}               #Search term(\Is escape processing)
                    $                  #End of string
                    ''', wiki_body, re.MULTILINE+re.VERBOSE+re.DOTALL)

templates = OrderedDict(re.findall(r'''
                          ^\|         # \Is escape processing, non-capture
                          (.+?)       #Capture target(key), Non-greedy
                          \s*         #0 or more whitespace characters
                          =           #Search terms, non-capture
                          \s*         #0 or more whitespace characters
                          (.+?)       #Capture target(Value), Non-greedy
                          (?:         #Start a group that is not captured
                            (?=\n\|)  #new line(\n)+'|'In front of(Affirmative look-ahead)
                          | (?=\n$)   #Or a line break(\n)+Before the end(Affirmative look-ahead)
                          )           #End of group not captured
                         ''', basic.group(1), re.MULTILINE+re.VERBOSE+re.DOTALL))

#Markup removal
def remove_markup(string):
    
    #Removal of highlighted markup
    #Removal target:''Distinguish from others''、'''Emphasis'''、'''''斜体とEmphasis'''''
    replaced = re.sub(r'''
                       (\'{2,5})   #2-5'(Start of markup)
                       (.*?)       #Any one or more characters (target character string)
                       (\1)        #Same as the first capture (end of markup)
                       ''', r'\2', string, flags=re.MULTILINE+re.VERBOSE)

    #Removal of internal link files
    #Removal target:[[Article title]]、[[Article title|Display character]]、[[Article title#Section name|Display character]]、[[File:Wi.png|thumb|Explanatory text]]
    replaced = re.sub(r'''
        \[\[             # '[['(Markup start)
        (?:              #Start a group that is not captured
            [^|]*?       # '|'Characters other than 0 characters or more, non-greedy
            \|           # '|'
        )*?              #Group end, this group appears 0 or more, non-greedy(Changes from No27)
        (                #Group start, capture target
          (?!Category:)  #Negative look-ahead(If it is included, it is excluded.)
          ([^|]*?)    # '|'Other than 0 characters, non-greedy(Character string to be displayed)
        )
        \]\]        # ']]'(Markup finished)
        ''', r'\1', replaced, flags=re.MULTILINE+re.VERBOSE)

    # Template:Removal of Lang
    #Removal target:{{lang|Language tag|String}}
    replaced = re.sub(r'''
        \{\{lang    # '{{lang'(Markup start)
        (?:         #Start a group that is not captured
            [^|]*?  # '|'0 or more characters other than, non-greedy
            \|      # '|'
        )*?         #Group end, this group appears 0 or more, non-greedy
        ([^|]*?)    #Capture target,'|'Other than 0 characters, non-greedy(Character string to be displayed)
        \}\}        # '}}'(Markup finished)
        ''', r'\1', replaced, flags=re.MULTILINE+re.VERBOSE)
    
    #Removal of external links
    #Target to be removed[http(s)://xxxx] 、[http(s)://xxx xxx]
    replaced = re.sub(r'''
        \[https?:// # '[http://'(Markup start)
        (?:           #Start a group that is not captured
            [^\s]*? #Zero or more non-blank characters, non-greedy
            \s      #Blank
        )?          #Group ends, this group appears 0 or 1
        ([^]]*?)    #Capture target,']'Other than 0 characters, non-greedy (character string to be displayed)
        \]          # ']'(End of markup)
        ''', r'\1', replaced, flags=re.MULTILINE+re.VERBOSE)

    #HTML tag removal
    #Target to be removed<xx> </xx> <xx/>
    replaced = re.sub(r'''
        <           # '<'(Start markup)
        .+?         #1 or more characters, non-greedy
        >           # '>'(End of markup)
        ''', '', replaced, flags=re.MULTILINE+re.VERBOSE)

    return replaced

for i, (key, value) in enumerate(templates.items()):
    replaced = remove_markup(value)
    templates[key] = replaced

#Request generation
url = 'https://www.mediawiki.org/w/api.php?' \
    + 'action=query' \
    + '&titles=File:' + parse.quote(templates['National flag image']) \
    + '&format=json' \
    + '&prop=imageinfo' \
    + '&iiprop=url'

#Send a request to a MediaWiki service
connection = request.urlopen(request.Request(url))

#Receive as json
response = json.loads(connection.read().decode())

print(response['query']['pages']['-1']['imageinfo'][0]['url'])

Answer commentary

The main part of this time is the following part. For the URL parameter, the value of "national flag image" is obtained from the dictionary created by the regular expression (Since spaces etc. are mixed if it is simply obtained, [ʻurllib.parse.quote` function]( Encoded using https://docs.python.org/ja/3/library/urllib.parse.html#urllib.parse.quote). It is easy to win without authentication.

python


#Request generation
url = 'https://www.mediawiki.org/w/api.php?' \
    + 'action=query' \
    + '&titles=File:' + parse.quote(templates['National flag image']) \
    + '&format=json' \
    + '&prop=imageinfo' \
    + '&iiprop=url'

#Send a request to a MediaWiki service
connection = request.urlopen(request.Request(url))

#Receive as json
response = json.loads(connection.read().decode())

print(response['query']['pages']['-1']['imageinfo'][0]['url'])

Output result (execution result)

When the program is executed, the following results will be output.

Output result


https://upload.wikimedia.org/wikipedia/commons/a/ae/Flag_of_the_United_Kingdom.svg

This image. alt

Recommended Posts

100 language processing knock-29: Get the URL of the national flag image
100 language processing knock-42: Display of the phrase of the person concerned and the person concerned
100 Language Processing Knock-59: Analysis of S-expressions
100 Language Processing Knock-91: Preparation of Analogy Data
100 Language Processing Knock-44: Visualization of Dependent Tree
100 Language Processing Knock-26: Removal of emphasized markup
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock-93 (using pandas): Calculate the accuracy rate of analogy tasks
I tried to get the batting results of Hachinai using image processing
100 Language Processing Knock-32 (using pandas): Prototype of verb
Image processing? The story of starting Python for
100 language processing knock-75 (using scikit-learn): weight of features
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
[C language] [Linux] Get the value of environment variable
100 Language Processing Knock-49: Extraction of Dependency Paths Between Nouns
Get the URL of the HTTP redirect destination in Python
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-77 (using scikit-learn): measurement of correct answer rate
Get the number of digits
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
The image display function of iTerm is convenient for image processing.
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
Understand the function of convolution using image processing as an example
Consider the speed of processing to shift the image buffer with numpy.ndarray
100 Language Processing Knock-96 (using Gensim): Extraction of vector for country name
Get the URL of a JIRA ticket created with the jira-python library
[Word2vec] Let's visualize the result of natural language processing of company reviews
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
Get the number of views of Qiita
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
100 language processing knock-76 (using scikit-learn): labeling