100 language processing knock-20 (using pandas): reading JSON data

Language processing 100 knocks 2015 "Chapter 3: Regular expressions" This is the record of 20th "Read JSON data" of .ac.jp/nlp100/#ch3). It was a review of what I did over a year ago, but I hardly remembered it. I used to google regular expressions every time I needed them, but I realized that it would be meaningless to me if I didn't output them to articles at least. The 20th one is very easy to read a JSON file in preparation for a regular expression task. I'm loading using pandas, but I realize the convenience of pandas again.

Reference link

Link Remarks
020.Read JSON data.ipynb Answer program GitHub link
100 amateur language processing knocks:20 Copy and paste source of many source parts
Python regular expression basics and tips to learn from scratch I organized what I learned in this knock
Regular expression HOWTO Python Official Regular Expression How To
re ---Regular expression operation Python official re package description

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
pandas 0.25.3

Chapter 3: Regular Expressions

content of study

By applying regular expressions to the markup description on Wikipedia pages, various information and knowledge can be extracted.

Regular Expressions, JSON, Wikipedia, InfoBox, Web Services

Knock content

File jawiki-country.json.gz that exports Wikipedia articles in the following format There is.

--One article information per line is stored in JSON format --In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. --The entire file is gzipped

Create a program that performs the following processing.

20. Reading JSON data

Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.

Answer

Answer program [020.Read JSON data.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/03.%E6%AD%A3%E8%A6%8F%E8%A1%A8% E7% 8F% BE / 020.JSON% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 81% AE% E8% AA% AD% E3% 81% BF% E8% BE% BC% E3% 81% BF.ipynb)

from pprint import pprint

import pandas as pd

df_wiki = pd.read_json('./jawiki-country.json', lines=True)
pprint(df_wiki[(df_wiki['title'] == 'England')]['text'].values.item())

Answer commentary

I'm reading a JSON file with the read_json function. You can read the format [JSON Lines](JSON Lines) by passing True to the lines parameter.

df_wiki = pd.read_json('./jawiki-country.json', lines=True)

The loaded DataFrame looks like this. The country name is included in title. image.png

The result is output at the end. I'm using the pprint function because I wanted to start a new line.

pprint(df_wiki[(df_wiki['title'] == 'England')]['text'].values.item())

Output result


('{{redirect|UK}}\n'
 '{{Basic information Country\n'
 '|Abbreviated name=England\n'

Omission

 '[[Category:Sovereign country]]\n'
 '[[Category:Island country|Kureito Furiten]]\n'
 '[[Category:States / Regions Established in 1801]]')

Recommended Posts

100 language processing knock-20 (using pandas): reading JSON data
100 Language Processing Knock-31 (using pandas): Verb
100 Language Processing Knock-38 (using pandas): Histogram
100 language processing knock-30 (using pandas): reading morphological analysis results
100 Language Processing Knock-33 (using pandas): Sahen noun
100 Language Processing Knock-35 (using pandas): Noun concatenation
100 Language Processing Knock-39 (using pandas): Zipf's Law
100 Language Processing Knock-34 (using pandas): "A B"
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 language processing knock-98 (using pandas): Ward's method clustering
100 language processing knock-99 (using pandas): visualization by t-SNE
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
100 language processing knock-92 (using Gensim): application to analogy data
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)
100 Language Processing Knock-83 (using pandas): Measuring word / context frequency
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-73 (using scikit-learn): learning
100 language processing knock-74 (using scikit-learn): Prediction
100 Language Processing Knock-84 (using pandas): Creating a word context matrix
100 Language Processing Knock-70 (using Stanford NLP): Obtaining and shaping data
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 language processing knock-97 (using scikit-learn): k-means clustering
100 Language Processing Knock-91: Preparation of Analogy Data
100 Language Processing Knock-71 (using Stanford NLP): Stopword
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock-90 (using Gensim): learning with word2vec
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-93 (using pandas): Calculate the accuracy rate of analogy tasks
100 Language Processing Knock-52: Stemming
100 language processing knock-79 (using scikit-learn): precision-recall graph drawing
100 Language Processing Knock Chapter 1
100 Language Processing Knock-41: Reading Parsing Results (Phrase / Dependency)
100 Amateur Language Processing Knock: 07
[Pandas] Basics of processing date data using dt
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 language processing knock-75 (using scikit-learn): weight of features
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 language processing knock-72 (using Stanford NLP): feature extraction
Process csv data with python (count processing using pandas)
100 language processing knock-94 (using Gensim): similarity calculation with WordSimilarity-353
100 language processing knocks-37 (using pandas): Top 10 most frequent words
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-25: Template Extraction