100 natural language processing knocks Chapter 3 Regular expressions (first half)

A record of solving the problems in the first half of Chapter 3. As you can see on the web page, the target file is jawiki-country.json, which is an extension of jawiki-country.json.gz.

There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format. Information of one article per line is stored in JSON format In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. The entire file is gzipped Create a program that performs the following processing.

</ i> 20. Reading JSON data

Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import re
import json

inputfile = 'jawiki-country.json'
outputfile = 'jawiki-england.txt'

f = open(inputfile)
g = open(outputfile, 'w')

target = re.compile(u'England')

for line in f:
    res = json.loads(line)
    if target.search(res[u'text']):
        g.write(res['text'].encode('utf8') + '\n')
f.close()
g.close()

#=> (File jawiki-england.Output to txt)

Use re module. Since Japanese is treated as a unicode character string, it is written in the form of ʻu'UK'`. Compile to a regular expression with the compile method of the re module, and determine whether each line contains target (UK) with the search method.

</ i> 21. Extract rows containing category names

Extract the line that declares the category name in the article.

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import re

inputfile = 'jawiki-england.txt'
outputfile = 'jawiki-england_category.txt'

f = open(inputfile)
g = open(outputfile, "w")

category = re.compile('\[\[Category:.+\]\]')

for line in f:
    if category.match(line):
        g.write(line.strip() + "\n")

f.close()
g.close()

#=> (File jawiki-england_category.Output to txt)

Same as the previous question. Judgment whether to include [[Category: ~]]. Escape [ and ] because it is in a regular expression.

</ i> 22. Extraction of category name

Extract the article category names (by name, not line by line).

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import re

inputfile = 'jawiki-england.txt'
outputfile = 'jawiki-england_category-name.txt'

f = open(inputfile)
g = open(outputfile, "w")

category = re.compile('\[\[Category:(.+)\]\]')

for line in f:
    l = category.match(line)
    if l:
        g.write(l.group(1) + "\n")

f.close()
g.close()

#=> (File jawiki-england_category-name.Output to txt)

Get the category name with the group method of the re module. You can get the part patterned by the part (. +) Enclosed in parentheses when compiling the regular expression. If the argument is 0, the whole match is returned, and if it is a numerical value, the patterned part of the number is returned (if the numerical value is larger than the number of patterns, IndexError is returned).

</ i> 23. Section structure

Display the section name and its level contained in the article (for example, 1 if "== section name ==").

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import re

inputfile = 'jawiki-england.txt'
outputfile = 'jawiki-england_section.txt'

f = open(inputfile)
g = open(outputfile, "w")

section = re.compile(r'=(=+) (.+) =')

for line in f:
    l = section.match(line)
    if l:
        g.write("sec%s : " % len(l.group(1)))
        g.write(l.group(2) + "\n")

f.close()
g.close()

#=> (File jawiki-england_section.Output to txt)

As in the previous problem, use the group method. The section level is determined by the number of =.

</ i> 24. Extracting file references

Extract all the media files referenced from the article.

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import re

inputfile = 'jawiki-england.txt'
outputfile = 'jawiki-england_media.txt'

f = open(inputfile)
g = open(outputfile, "w")

mediafile = re.compile(r".*(File|File|file):(.*\.[a-zA-Z0-9]+)\|.*")

for line in f:
    l = mediafile.match(line)
    if l:
        g.write(l.group(2) + "\n")

f.close()
g.close()

#=> (File jawiki-england_media.Output to txt)

Same as the problem so far.

I found it complicated to use group, but after solving a few questions, I somehow understood it.

Recommended Posts