100 amateur language processing knocks: 27

It is a challenge record of Language processing 100 knock 2015. The environment is Ubuntu 16.04 LTS + Python 3.5.2 : : Anaconda 4.1.1 (64-bit). Click here for a list of past knocks (http://qiita.com/segavvy/items/fb50ba8097d59475f760).

Chapter 3: Regular Expressions

There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format. ・ One article information is stored in JSON format per line -In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. -The entire file is gzipped Create a program that performs the following processing.

27. Removal of internal links

In addition to processing> 26, remove MediaWiki's internal link markup from the template value and convert it to text (Reference: Markup Quick Reference Simplified chart)).

The finished code:

main.py


# coding: utf-8
import gzip
import json
import re
fname = 'jawiki-country.json.gz'


def extract_UK():
	'''Get the body of an article about the UK

Return value:
British article text
	'''

	with gzip.open(fname, 'rt') as data_file:
		for line in data_file:
			data_json = json.loads(line)
			if data_json['title'] == 'England':
				return data_json['text']

	raise ValueError('I can't find a British article')


def remove_markup(target):
	'''Markup removal
Remove highlighted markup and internal links

argument:
	target --Target string
Return value:
String with markup removed
	'''

	#Removal of highlighted markup
	pattern = re.compile(r'''
		(\'{2,5})	#2-5'(Start of markup)
		(.*?)		#Any one or more characters (target character string)
		(\1)		#Same as the first capture (end of markup)
		''', re.MULTILINE + re.VERBOSE)
	target = pattern.sub(r'\2', target)

	#Removal of internal links
	pattern = re.compile(r'''
		\[\[		# '[['(Start of markup)
		(?:			#Start a group that is not captured
			[^|]*?	# '|'0 or more characters other than, non-greedy
			\|		# '|'
		)??			#Group end, this group appears 0 or 1, non-greedy
		([^|]*?)	#Capture target,'|'Other than 0 characters, non-greedy (character string to be displayed)
		\]\]		# ']]'(End of markup)
		''', re.MULTILINE + re.VERBOSE)
	target = pattern.sub(r'\1', target)

	return target


#Compiling the extraction conditions of the basic information template
pattern = re.compile(r'''
	^\{\{Basic information.*?$	# '{{Basic information'Lines starting with
	(.*?)		#Capture target, any 0 or more characters, non-greedy
	^\}\}$		# '}}'Line
	''', re.MULTILINE + re.VERBOSE + re.DOTALL)

#Extraction of basic information template
contents = pattern.findall(extract_UK())

#Extraction condition compilation of field name and value from extraction result
pattern = re.compile(r'''
	^\|			# '|'Lines starting with
	(.+?)		#Capture target (field name), any one or more characters, non-greedy
	\s*			#0 or more whitespace characters
	=
	\s*			#0 or more whitespace characters
	(.+?)		#Capture target (value), any one or more characters, non-greedy
	(?:			#Start a group that is not captured
		(?=\n\|) 	#new line+'|'Before (Affirmative look-ahead)
		| (?=\n$)	#Or a line break+Before the end (affirmative look-ahead)
	)			#Group end
	''', re.MULTILINE + re.VERBOSE + re.DOTALL)

#Extracting field names and values
fields = pattern.findall(contents[0])

#Set in dictionary
result = {}
keys_test = []		#List of field names in order of appearance for confirmation
for field in fields:
	result[field[0]] = remove_markup(field[1])
	keys_test.append(field[0])

#Displayed for confirmation (keys for easy confirmation_Sort by field name appearance using test)
for item in sorted(result.items(),
		key=lambda field: keys_test.index(field[0])):
	print(item)

Execution result:

Terminal


('Abbreviated name', 'England')
('Japanese country name', 'United Kingdom of Great Britain and Northern Ireland')
('Official country name', '{{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br/>\n*{{lang|gd|An Rìoghachd Aonaichte na Breatainn Mhòr agus Eirinn mu Thuath}}(Scottish Gaelic)<br/>\n*{{lang|cy|Teyrnas Gyfunol Prydain Fawr a Gogledd Iwerddon}}(Welsh)<br/>\n*{{lang|ga|Ríocht Aontaithe na Breataine Móire agus Tuaisceart na hÉireann}}(Irish)<br/>\n*{{lang|kw|An Rywvaneth Unys a Vreten Veur hag Iwerdhon Glédh}}(Cornish)<br/>\n*{{lang|sco|Unitit Kinrick o Great Breetain an Northren Ireland}}(Scots)<br/>\n**{{lang|sco|Claught Kängrick o Docht Brätain an Norlin Airlann}}、{{lang|sco|Unitet Kängdom o Great Brittain an Norlin Airlann}}(Ulster Scots)</ref>')
('National flag image', 'Flag of the United Kingdom.svg')
('National emblem image', '[[File:Royal Coat of Arms of the United Kingdom.svg|85px|British coat of arms]]')
('National emblem link', '(National emblem)')
('Motto', '{{lang|fr|Dieu et mon droit}}<br/>(French:God and my rights)')
('National anthem', 'God Save the Queen')
('Position image', 'Location_UK_EU_Europe_001.svg')
('Official terminology', 'English (virtually)')
('capital', 'London')
('Largest city', 'London')
('Head of state title', 'Queen')
('Name of head of state', 'Elizabeth II')
('Prime Minister's title', 'Prime Minister')
('Prime Minister's name', 'David Cameron')
('Area ranking', '76')
('Area size', '1 E11')
('Area value', '244,820')
('Water area ratio', '1.3%')
('Demographic year', '2011')
('Population ranking', '22')
('Population size', '1 E7')
('Population value', '63,181,775<ref>[http://esa.un.org/unpd/wpp/Excel-Data/population.htm United Nations Department of Economic and Social Affairs>Population Division>Data>Population>Total Population]</ref>')
('Population density value', '246')
('GDP statistics year yuan', '2012')
('GDP value source', '1,547.8 billion<ref name="imf-statistics-gdp">[http://www.imf.org/external/pubs/ft/weo/2012/02/weodata/weorept.aspx?pr.x=70&pr.y=13&sy=2010&ey=2012&scsm=1&ssd=1&sort=country&ds=.&br=1&c=112&s=NGDP%2CNGDPD%2CPPPGDP%2CPPPPC&grp=0&a= IMF>Data and Statistics>World Economic Outlook Databases>By Countrise>United Kingdom]</ref>')
('GDP Statistics Year MER', '2012')
('GDP ranking MER', '5')
('GDP value MER', '2,433.7 billion<ref name="imf-statistics-gdp" />')
('GDP statistical year', '2012')
('GDP ranking', '6')
('GDP value', '2,316.2 billion<ref name="imf-statistics-gdp" />')
('GDP/Man', '36,727<ref name="imf-statistics-gdp" />')
('Founding form', 'Founding of the country')
('Established form 1', 'Kingdom of England / Kingdom of Scotland<br />(Both countries until the Act of Union 1707)')
('Date of establishment 1', '927/843')
('Established form 2', 'Founding of the Kingdom of Great Britain<br />(Acts of Union 1707)')
('Date of establishment 2', '1707')
('Established form 3', 'United Kingdom of Great Britain and Ireland founded<br />(Acts of Union 1800)')
('Date of establishment 3', '1801')
('Established form 4', 'Changed to the current country name "United Kingdom of Great Britain and Northern Ireland"')
('Date of establishment 4', '1927')
('currency', 'UK pounds(&pound;)')
('Currency code', 'GBP')
('Time zone', '±0')
('Daylight saving time', '+1')
('ISO 3166-1', 'GB / GBR')
('ccTLD', '.uk / .gb<ref>Use is.Overwhelmingly small number compared to uk.</ref>')
('International call number', '44')
('Note', '<references />')

Removal of internal link "only"

[Previous question] remove_markup () of (http://qiita.com/segavvy/items/f6d0f3d6eee5acc33c58) has been repaired. What I'm addicted to this time is that it works well! If you think and check[[File:Royal Coat of Arms of the United Kingdom.svg|85px|British coat of arms]]WhenいったFileの指定まで巻き込んでいたこWhenです。そこで、[[When]]In the range enclosed by|が0または1個しか出てこないWhenいう条件に変更して(Fileの場合は2個出てくる)、Fileを巻き込むのを防いでみました。 However, this includes categories such as [[Category: UK | *]] that were targeted in issues 21 and 22 and [Markup Quick Reference](https://ja.wikipedia.org/wiki/Help: It is a little halfway because it involves redirects such as #REDIRECT [[article name]] in the quick reference table). It's okay because it doesn't happen to appear in the target data this time, but it may not have answered the intention of the question ...

That's all for the 28th knock. If you have any mistakes, I would appreciate it if you could point them out.


Recommended Posts

100 amateur language processing knocks: 41
100 amateur language processing knocks: 71
100 amateur language processing knocks: 24
100 amateur language processing knocks: 50
100 amateur language processing knocks: 70
100 amateur language processing knocks: 62
100 amateur language processing knocks: 60
100 amateur language processing knocks: 92
100 amateur language processing knocks: 30
100 amateur language processing knocks: 06
100 amateur language processing knocks: 84
100 amateur language processing knocks: 81
100 amateur language processing knocks: 33
100 amateur language processing knocks: 46
100 amateur language processing knocks: 88
100 amateur language processing knocks: 89
100 amateur language processing knocks: 40
100 amateur language processing knocks: 45
100 amateur language processing knocks: 43
100 amateur language processing knocks: 55
100 amateur language processing knocks: 22
100 amateur language processing knocks: 61
100 amateur language processing knocks: 94
100 amateur language processing knocks: 54
100 amateur language processing knocks: 04
100 amateur language processing knocks: 63
100 amateur language processing knocks: 78
100 amateur language processing knocks: 08
100 amateur language processing knocks: 42
100 amateur language processing knocks: 19
100 amateur language processing knocks: 73
100 amateur language processing knocks: 75
100 amateur language processing knocks: 83
100 amateur language processing knocks: 95
100 amateur language processing knocks: 96
100 amateur language processing knocks: 72
100 amateur language processing knocks: 79
100 amateur language processing knocks: 23
100 amateur language processing knocks: 05
100 amateur language processing knocks: 00
100 amateur language processing knocks: 02
100 amateur language processing knocks: 37
100 amateur language processing knocks: 21
100 amateur language processing knocks: 68
100 amateur language processing knocks: 11
100 amateur language processing knocks: 90
100 amateur language processing knocks: 66
100 amateur language processing knocks: 28
100 amateur language processing knocks: 64
100 amateur language processing knocks: 34
100 amateur language processing knocks: 77
100 amateur language processing knocks: 01
100 amateur language processing knocks: 16
100 amateur language processing knocks: 27
100 amateur language processing knocks: 10
100 amateur language processing knocks: 03
100 amateur language processing knocks: 82
100 amateur language processing knocks: 69
100 amateur language processing knocks: 53
100 amateur language processing knocks: 18
100 amateur language processing knocks: 35