100 amateur language processing knocks: 19

It is a challenge record of Language processing 100 knock 2015. The environment is Ubuntu 16.04 LTS + Python 3.5.2 : : Anaconda 4.1.1 (64-bit). Click here for a list of past knocks (http://qiita.com/segavvy/items/fb50ba8097d59475f760).

Chapter 2: UNIX Command Basics

hightemp.txt is a file that stores the record of the highest temperature in Japan in the tab-delimited format of "prefecture", "point", "℃", and "day". Create a program that performs the following processing and execute hightemp.txt as an input file. Furthermore, execute the same process with UNIX commands and check the execution result of the program.

19. Find the frequency of appearance of the character string in the first column of each line, and arrange them in descending order of frequency of appearance.

Find the frequency of occurrence of the character string in the first column of each line, and display them in descending order. Use the cut, uniq, and sort commands for confirmation.

The finished code:

`main.py`


# coding: utf-8
from itertools import groupby
fname = 'hightemp.txt'

#Reading prefecture names
lines = open(fname).readlines()
kens = [line.split('\t')[0] for line in lines]

#Aggregate by prefecture,(Prefectures,Frequency of appearance)List
kens.sort()    #goupby is supposed to be sorted
result = [(ken, len(list(group))) for ken, group in groupby(kens)]

#Sort by frequency of occurrence
result.sort(key=lambda ken: ken[1], reverse=True)

#Result output
for ken in result:
	print('{ken}({count})'.format(ken=ken[0], count=ken[1]))

Execution result:

`Terminal`


Saitama(3)
Yamagata Prefecture(3)
Yamanashi Prefecture(3)
Gunma Prefecture(3)
Chiba(2)
Gifu Prefecture(2)
Aichi prefecture(2)
Shizuoka Prefecture(2)
Wakayama Prefecture(1)
Osaka(1)
Ehime Prefecture(1)
Kochi Prefecture(1)

Shell script for UNIX command confirmation:

`test.sh`


#!/bin/sh

#Sort by 1st column, deduplication and output with number, sort the result
cut --fields=1 hightemp.txt | sort | uniq --count | sort --reverse

Confirmation of results:

`Terminal`


segavvy@ubuntu:~/document/100 language processing knock 2015/19$ ./test.sh
3 Yamanashi Prefecture
3 Yamagata Prefecture
3 Saitama Prefecture
3 Gunma Prefecture
2 Chiba
2 Shizuoka Prefecture
2 Gifu Prefecture
2 Aichi prefecture
1 Wakayama Prefecture
1 Osaka
1 Kochi prefecture
1 Ehime prefecture

As with the previous question, the order of places with the same appearance frequency will change, so I did not do my best to match the format and diff, and checked visually.

Comprehension notation

This time I tried using the inclusion notation for the first time. The part that creates the list has somehow become readable and writable.

Aggregation processing

The part to be aggregated by prefecture is [itertools.groupby ()] I used (http://docs.python.jp/3/library/itertools.html#itertools.groupby). It is important to note that it is necessary to sort in advance like the UNIX ʻuniq` used for confirmation.

Misinterpretation of the problem

Actually, at first, I misunderstood the meaning of the problem,

Find the frequency of occurrence of the character string in the first column of each line, and display [each line] side by side in descending order.

I thought it was. Since it's a big deal, I'll post the code and result as well.

Completed code (misunderstanding version):

`main2.py`


# coding: utf-8
from itertools import groupby
fname = 'hightemp.txt'


def get_ken(target):
	'''Cut out the prefecture part from the data for one line

argument:
	target --One line of data
Return value:
Character string of prefecture
	'''
	return target.split('\t')[0]

#Read
lines = open(fname).readlines()

#Aggregated by prefecture
lines.sort(key=get_ken)    #goupby is supposed to be sorted
groups = groupby(lines, key=get_ken)

#Aggregate results(Prefectures,Frequency of appearance,List of applicable lines)Convert to a list of
result = []
for ken, group in groups:
	lines = list(group)
	result.append((ken, len(lines), lines))

#Sort by frequency of occurrence
result.sort(key=lambda group: group[1], reverse=True)

#Result display
for group in result:
	for line in group[2]:
		print(line, end='')

get_ken () is a function that retrieves the prefecture in the first column, and since it is used in two places, sort () and groupby (), it is a function instead of a lambda expression.

Execution result (misunderstanding version):

`Terminal`


40 Kumagaya, Saitama Prefecture.9	2007-08-16
40 Koshigaya, Saitama Prefecture.4	2007-08-16
39 Hatoyama, Saitama Prefecture.9	1997-07-05
Yamagata 40 Yamagata.8	1933-07-25
40 Sakata, Yamagata Prefecture.1	1978-08-03
39 Tsuruoka, Yamagata Prefecture.9	1978-08-03
Yamanashi Prefecture Kofu 40.7	2013-08-10
40 Katsunuma, Yamanashi Prefecture.5	2013-08-10
Yamanashi Prefecture Otsuki 39.9	1990-07-19
Gunma Prefecture Tatebayashi 40.3	2007-08-16
40 Kamisatomi, Gunma Prefecture.3	1998-07-04
Gunma Prefecture Maebashi 40 2001-07-24
Chiba Prefecture Ushiku 40.2	2004-07-20
39 Mobara, Chiba.9	2013-08-11
40 Tajimi, Gifu Prefecture.9	2007-08-16
Gifu Prefecture Mino 40 2007-08-16
Aisai 40, Aichi Prefecture.3	1994-08-05
Aichi Prefecture Nagoya 39.9	1942-08-02
Shizuoka Prefecture Tenryu 40.6	1994-08-04
40 Sakuma, Shizuoka Prefecture.2	2001-07-24
Wakayama Prefecture Katsuragi 40.6	1994-08-08
Toyonaka 39, Osaka.9	1994-08-08
40 Uwajima, Ehime Prefecture.2	1927-07-22
Kochi Prefecture Ekawasaki 41 2013-08-12

That's all for the 20th knock. If you have any mistakes, I would appreciate it if you could point them out.