Language processing 100 knocks 2015 "Chapter 3: Regular expressions" This is the record of 20th "Read JSON data" of .ac.jp/nlp100/#ch3).
It was a review of what I did over a year ago, but I hardly remembered it. I used to google regular expressions every time I needed them, but I realized that it would be meaningless to me if I didn't output them to articles at least.
The 20th one is very easy to read a JSON file in preparation for a regular expression task. I'm loading using pandas
, but I realize the convenience of pandas
again.
Link | Remarks |
---|---|
020.Read JSON data.ipynb | Answer program GitHub link |
100 amateur language processing knocks:20 | Copy and paste source of many source parts |
Python regular expression basics and tips to learn from scratch | I organized what I learned in this knock |
Regular expression HOWTO | Python Official Regular Expression How To |
re ---Regular expression operation | Python official re package description |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.15 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.6.9 | python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
pandas | 0.25.3 |
By applying regular expressions to the markup description on Wikipedia pages, various information and knowledge can be extracted.
Regular Expressions, JSON, Wikipedia, InfoBox, Web Services
File jawiki-country.json.gz that exports Wikipedia articles in the following format There is.
--One article information per line is stored in JSON format --In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. --The entire file is gzipped
Create a program that performs the following processing.
Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.
from pprint import pprint
import pandas as pd
df_wiki = pd.read_json('./jawiki-country.json', lines=True)
pprint(df_wiki[(df_wiki['title'] == 'England')]['text'].values.item())
I'm reading a JSON file with the read_json
function. You can read the format [JSON Lines](JSON Lines) by passing True to the lines
parameter.
df_wiki = pd.read_json('./jawiki-country.json', lines=True)
The loaded DataFrame
looks like this. The country name is included in title
.
The result is output at the end. I'm using the pprint
function because I wanted to start a new line.
pprint(df_wiki[(df_wiki['title'] == 'England')]['text'].values.item())
Output result
('{{redirect|UK}}\n'
'{{Basic information Country\n'
'|Abbreviated name=England\n'
Omission
'[[Category:Sovereign country]]\n'
'[[Category:Island country|Kureito Furiten]]\n'
'[[Category:States / Regions Established in 1801]]')
Recommended Posts