100 natural language processing knocks Chapter 3 Regular expressions (first half)

A record of solving the problems in the first half of Chapter 3. As you can see on the web page, the target file is jawiki-country.json, which is an extension of jawiki-country.json.gz.

There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format. Information of one article per line is stored in JSON format In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. The entire file is gzipped Create a program that performs the following processing.

100 natural language processing knocks Chapter 3 Regular expressions (first half)

</ i> 20. Reading JSON data

</ i> 21. Extract rows containing category names

</ i> 22. Extraction of category name

</ i> 23. Section structure

</ i> 24. Extracting file references