Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 second half)

16. Divide the file into N

Receive the natural number N by means such as command line arguments, and divide the input file into N line by line. Achieve the same processing with the split command.

Omitting import and argparse settings. If the number of lines M in the file is not exactly divisible by the given natural number N, It is a specification that gives one more line in order from the first divided part.

`knock016.py`


args = parser.parse_args()
N = args.line
filename = args.filename

#Show last N lines
f = open(filename)
lines = f.readlines()
M = len(lines)

#Merchandise and remainder
quotient = M/N
remainder = M - quotient*N

#Find the line that splits the file
num_of_lines = [quotient+1 if i < remainder else quotient for i in xrange(N)]
num_of_lines_cumulative = [sum(num_of_lines[:i+1]) for i in xrange(N)]

for i, line in enumerate(lines):
	if i in num_of_lines_cumulative:
		print
		print line.strip()
	else:
		print line.strip()

f.close()

UNIX command ... After adding the optional validation (although not enough), the code became longer.

`knock016.sh`


#!/bin/sh

#Receive the natural number N by means such as command line arguments, and divide the input file into N line by line.
#Achieve the same processing with the split command.
# ex.
# sh knock016.sh -f hightemp.txt -n 7

while getopts f:n: OPT
do
  case $OPT in
    "f" ) FLG_F="TRUE" ; INPUT_FILE=$OPTARG ;;
    "n" ) FLG_N="TRUE" ; N=$OPTARG ;;
      * ) echo "Usage: $CMDNAME [-f file name] [-n split number]" 1>&2
          exit 1 ;;
  esac
done

if [ ! "$FLG_F" = "TRUE" ]; then
  echo 'file name is not set.'
  exit 1
fi
if [ ! "$FLG_N" = "TRUE" ]; then
  echo 'split number is not set.'
  exit 1
fi

#INPUT_FILE="hightemp.txt"
TMP_HEAD="split/tmphead.$INPUT_FILE"
TMP_TAIL="split/tmptail.$INPUT_FILE"
SPLITHEAD_PREFIX="split/splithead."
SPLITTAIL_PREFIX="split/splittail."

M=$( wc -l < $INPUT_FILE )
#N=9
quotient=`expr \( $M / $N \)`
remainder=`expr \( $M - $quotient \* $N \)`

if [ $quotient -eq 0 ]; then
  echo "cannot divide: N is larger than the lines of the input file."
  exit 0
fi

if [ $remainder -eq 0 ]; then
  #If the remainder is 0, it will be in one file$Split to include quotient lines
  split -l $quotient $INPUT_FILE SPLITHEAD_PREFIX
else
  #If the remainder is non-zero,
  # (a)From the beginning(($quotient + 1) * $remainder)Line and(b)After that, divide it into 2 files
  split_head=`expr \( \( $quotient + 1 \) \* $remainder \)`
  split_tail=`expr \( $M - $split_head \)`
  head -n $split_head $INPUT_FILE > $TMP_HEAD
  tail -n $split_tail $INPUT_FILE > $TMP_TAIL

  # (a)In one file($quotient+1)line,(b)In one file$quotientline,含まれるように分割する
  split -l `expr \( $quotient + 1 \)` $TMP_HEAD $SPLITHEAD_PREFIX
  split -l $quotient $TMP_TAIL $SPLITTAIL_PREFIX

  rm -iv split/tmp*

fi

Since split is a command used by specifying the number of lines contained in one file, Impression that a little ingenuity was needed.

17. Difference in the character string in the first column

Find the type of string in the first column (a set of different strings). Use the sort and uniq commands for confirmation.

`python`


if __name__ == '__main__':

	f = open(filename)
	lines = f.readlines()

	# unlike problem 12., "+ '\n'" is not necessary
	content_col1 = [line.split()[0] for line in lines]
	content_col1_set = set(content_col1)
	print len(content_col1_set)

	for x in content_col1_set:
		print x

	f.close()

#>>>
#12
#Aichi prefecture
#Yamagata Prefecture
#Gifu Prefecture
#Chiba
#Saitama
#Kochi Prefecture
#Gunma Prefecture
#Yamanashi Prefecture
#Wakayama Prefecture
#Ehime Prefecture
#Osaka
#Shizuoka Prefecture

UNIX command. Do I have to do the same order ...?

`python`


awk -F'\t' '{print $1;}' hightemp.txt | sort | uniq
#>>>
#Chiba
#Wakayama Prefecture
#Saitama
#Osaka
#Yamagata Prefecture
#Yamanashi Prefecture
#Gifu Prefecture
#Ehime Prefecture
#Aichi prefecture
#Gunma Prefecture
#Shizuoka Prefecture
#Kochi Prefecture

18. Sort each row in descending order of the numbers in the third column

Arrange each row in the reverse order of the numbers in the third column (Note: sort the contents of each row unchanged). Use the sort command for confirmation (this problem does not have to match the result of executing the command).

`python`


if __name__ == '__main__':

	f = open(filename)
	lines = f.readlines()
	# reverse=True allows us to perform descending sort
	sorted_lines = sorted(lines, key=lambda line: float(line.split()[2]), reverse=True)

	for sorted_line in sorted_lines:
		print sorted_line,

	f.close()

#>>>
#Kochi Prefecture Ekawasaki 41 2013-08-12
#40 Kumagaya, Saitama Prefecture.9	2007-08-16
#40 Tajimi, Gifu Prefecture.9	2007-08-16
#Yamagata 40 Yamagata.8	1933-07-25
#Yamanashi Prefecture Kofu 40.7	2013-08-10
#Wakayama Prefecture Katsuragi 40.6	1994-08-08
#Shizuoka Prefecture Tenryu 40.6	1994-08-04
#40 Katsunuma, Yamanashi Prefecture.5	2013-08-10
#40 Koshigaya, Saitama Prefecture.4	2007-08-16
#Gunma Prefecture Tatebayashi 40.3	2007-08-16
#40 Kamisatomi, Gunma Prefecture.3	1998-07-04
#Aisai 40, Aichi Prefecture.3	1994-08-05
#Chiba Prefecture Ushiku 40.2	2004-07-20
#40 Sakuma, Shizuoka Prefecture.2	2001-07-24
#40 Uwajima, Ehime Prefecture.2	1927-07-22
#40 Sakata, Yamagata Prefecture.1	1978-08-03
#Gifu Prefecture Mino 40 2007-08-16
#Gunma Prefecture Maebashi 40 2001-07-24
#39 Mobara, Chiba.9	2013-08-11
#39 Hatoyama, Saitama Prefecture.9	1997-07-05
#Toyonaka 39, Osaka.9	1994-08-08
#Yamanashi Prefecture Otsuki 39.9	1990-07-19
#39 Tsuruoka, Yamagata Prefecture.9	1978-08-03
#Aichi Prefecture Nagoya 39.9	1942-08-02

UNIX command.

`python`


sort -k3r hightemp.txt

Specify the column with the k option. Add r and reverse order.

19. Find the frequency of appearance of the character string in the first column of each line, and arrange them in descending order of frequency of appearance.

Find the frequency of occurrence of the character string in the first column of each line, and display them in descending order. Use the cut, uniq, and sort commands for confirmation.

`python`


from collections import defaultdict
from collections import Counter

...

if __name__ == '__main__':

	f = open(filename)
	lines = f.readlines()

	# extract 1st column
	content_col1 = [line.split()[0] for line in lines]
	
	# (1) defaultdict
	# http://docs.python.jp/2/library/collections.html#collections.defaultdict
	d = defaultdict(int)
	for col1 in content_col1:
		d[col1] += 1
	for word, cnt in sorted(d.items(), key=lambda x: x[1], reverse=True):
		print word, cnt

	print

	# (2) Counter
	# http://docs.python.jp/2/library/collections.html#collections.Counter
	counter = Counter(content_col1)
	for word, cnt in counter.most_common():
		print word, cnt

	f.close()

#>>>
#Yamagata Prefecture 3
#Saitama Prefecture 3
#Gunma Prefecture 3
#Yamanashi 3
#Aichi 2
#Gifu prefecture 2
#Chiba 2
#Shizuoka Prefecture 2
#Kochi Prefecture 1
#Wakayama Prefecture 1
#Ehime Prefecture 1
#Osaka 1

#Yamagata Prefecture 3
#Saitama Prefecture 3
#Gunma Prefecture 3
#Yamanashi 3
#Aichi 2
#Gifu prefecture 2
#Chiba 2
#Shizuoka Prefecture 2
#Kochi Prefecture 1
#Wakayama Prefecture 1
#Ehime Prefecture 1
#Osaka 1

Whether to count with the defaultdict type as in (1) As in (2), do you use the Counter itself? There is a most_common () method ...

Then UNIX command.

`python`


cut -f 1 hightemp.txt | sort | uniq -c | sort -nr
#>>>
#3 Gunma Prefecture
#3 Yamanashi Prefecture
#3 Yamagata Prefecture
#3 Saitama Prefecture
#2 Shizuoka Prefecture
#2 Aichi prefecture
#2 Gifu Prefecture
#2 Chiba
#1 Kochi prefecture
#1 Ehime prefecture
#1 Osaka
#1 Wakayama Prefecture

It's an idiom-like command that I often use, so I want to remember it well. Sort by sort, and if there is the same thing in the adjacent line with uniq, put it together, Use the -c option to count such duplicate rows "sort -nr" sorts the rows as numbers (in descending order).