On Linux there is a command called tail
that allows you to get the n
line from the end of the file. It's pretty convenient so I want to be able to do the same with Python.
I would like to create a function that retrieves n lines from the end of a file with tail (file_name, n)
using several approaches.
For the last approach, go to the site it-swarm.dev Efficiently find the last line in your text file -I refer to the page swarm.dev/ja/python/ Efficiently find the last line of a text file / 940298444 /).
The file to be read could be a text file, but this time I will use the csv
file.
The file name is test.csv
. The content is a summary of Bitcoin prices for 86400 lines (one day) per second.
test.csv
date,price,size
1588258800,933239.0,3.91528007
1588258801,933103.0,3.91169431
1588258802,932838.0,2.91
1588258803,933217.0,0.5089811
(Omission)
1588345195,955028.0,0.0
1588345196,954959.0,0.05553
1588345197,954984.0,1.85356
1588345198,955389.0,10.91445135
1588345199,955224.0,3.61106
Although it has nothing to do with the main subject, if you explain each item for the time being, the units of date, price, and size are UnixTime, YEN, BTC.
The first line means that at time 1588258800
, that is, at 0:00:00 on May 1st, 3.915280007
of Bitcoin was bought and sold for 933239.0
yen.
First, use the built-in function ʻopen ()` to get the file object, read all the lines from the beginning, and output only the last n lines. If n is 0 or a negative integer, strange results will be obtained, so it is actually necessary to perform processing limited to natural numbers, but it is important to make it easy to see.
python
def tail(fn, n):
#Open the file and get all the lines in a list
with open(fn, 'r') as f:
#Read one line.The first line is the header, so discard the result
f.readline()
#Read all lines
lines = f.readlines()
#Returns only n lines from the back
return lines[-n:]
#result
file_name = 'test.csv'
tail(file_name, 3)
# ['1588345197,954984.0,1.85356\n',
# '1588345198,955389.0,10.91445135\n',
# '1588345199,955224.0,3.61106\n']
If it is a text file, you can leave it as it is, but make it a little easier to use for csv files.
python
def tail(fn, n):
#Open the file and get all the lines in a list
with open(fn, 'r') as f:
f.readline()
lines = f.readlines()
#Return a string as an array.By the way str->Type convert to float
return [list(map(float ,line.strip().split(','))) for line in lines[-n:]]
#result
tail(file_name, 3)
# [[1588345197.0, 954984.0, 1.85356],
# [1588345198.0, 955389.0, 10.91445135],
# [1588345199.0, 955224.0, 3.61106]]
The only thing that has changed is the return
line, but the functions are so crowded that it's hard to understand, so I'll break it down.
The following processing is performed for each line.
strip ()
'1588345197,954984.0,1.85356\n'
-> '1588345197,954984.0,1.85356'
split ()
'1588345197,954984.0,1.85356'
-> ['1588345197', '954984.0', '1.85356']
map ()
['1588345197', '954984.0', '1.85356']
-> [1588345197.0, 954984.0, 1.85356]
Since the csv module automatically converts each line to an array, the processing will be a little slower, but it can be described more concisely.
python
import csv
def tail_csv(fn, n):
with open(fn) as f:
#Convert file object to csv reader
reader = csv.reader(f)
#Discard the header
next(reader)
#Read all lines
rows = [row for row in reader]
#Float only the last n lines and return
return [list(map(float, row)) for row in rows[-n:]]
Since pandas has a tail function, it is surprisingly easy to write.
python
import pandas as pd
def tail_pd(fn, n):
df = pd.read_csv(fn)
return df.tail(n).values.tolist()
Since pandas deals with numpy arrays, it is converted to a list at the end with tolist ()
. It is not necessary if you can keep the numpy array.
ʻIpython has a convenient command called
timeit`, so let's compare it with the number of loops set to 100.
timeit -n100 tail('test.csv', 3)
18.8 ms ± 175 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
timeit -n100 tail_csv('test.csv', 3)
67 ms ± 822 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
timeit -n100 tail_pd('test.csv', 3)
30.4 ms ± 2.45 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
It turned out that it was quick to read as it was without using any module. Cospa seems to be the best because pandas is the simplicity of the code and the speed is reasonable. Since the csv module purposely converts from a character string to an array up to the unused line, the result is extremely poor.
All of the approaches so far are reading all the lines after all. However, I want the last few lines, so if there is a way to read the file from the back, the reading should be completed in an instant.
Refer to the page [Efficiently find the last line of a text file](https://www.it-swarm.dev/ja/python/ Efficiently find the last line of a text file / 940298444 /) did.
Read about 100 bytes at a time from the back, and if a line feed code is found, the character string after that is the last line. Only the last line is found in the page, but to realize the tail
command You need to find the n
line from the back, so adjust only there.
First, as a preliminary knowledge, we will explain how to operate the file pointer.
There are three functions to use: f.tell ()
, f.read (size)
, and f.seek (offset, whence)
.
f.tell ()
returns the position currently pointed to by the pointer.
f.read (size)
returns the contents read size
bytes from the current position. The pointer moves to the read position. It can only be advanced in the positive direction.
f.seek (offset, whence)
is a function that moves the position of the pointer.
The argument whence
represents the position. One of the values 0, 1, 2
is entered. 0
is the beginning of the file, 1
is the current pointer position, and 2
is the end of the file. Means.
Enter an integer for ʻoffset. Unlike
read, you can also pass a negative value, so for example,
f.seek (-15, 1)` returns the current pointer position to the beginning by 15.
We will implement it based on these.
python
#Use split that can use regular expressions
import re
def tail_b(fn, n=None):
#If n is not given, only the last line is returned alone.
if n is None:
n = 1
is_list = False
#n is a natural number
elif type(n) != int or n < 1:
raise ValueError('n has to be a positive integer')
#When n is given, n rows are returned together in a list.
else:
is_list = True
# 128 *Read n bytes at a time
chunk_size = 64 * n
# seek()Behaves unexpectedly except in binary mode'rb'To specify
with open(fn, 'rb') as f:
#First line to find the leftmost position excluding the header(Header line)I Read
f.readline()
#The very first line feed code is at the left end(End when reading from the end of the file)To
# -1 is'\n'1 byte
left_end = f.tell() - 1
#End of file(2)1 byte back from. read(1)To read in
f.seek(-1, 2)
#Because there are often blank lines and spaces at the end of the file
#Position of the last character in the file excluding them(Right end)Find
while True:
if f.read(1).strip() != b'':
#Right end
right_end = f.tell()
break
#Take one step, so take two steps down
f.seek(-2, 1)
#Number of bytes remaining unread to the far left
unread = right_end - left_end
#Number of lines read.If this becomes n or more, it means that n lines have been read.
num_lines = 0
#Variable for connecting the read byte strings
line = b''
while True:
#The number of unread bytes is chunk_When it becomes smaller than size,Chunk fraction_size
if unread < chunk_size:
chunk_size = f.tell() - left_end
#Chunk from your current location_Move to the top of the file by size
f.seek(-chunk_size, 1)
#Read only the amount you moved
chunk = f.read(chunk_size)
#Connect
line = chunk + line
#Since I proceeded again with read, chunk again at the beginning_size move
f.seek(-chunk_size, 1)
#Update the number of unread bytes
unread -= chunk_size
#If a line feed code is included
if b'\n' in chunk:
#Num for the number of line feed codes_Count up lines
num_lines += chunk.count(b'\n')
#Read more than n lines,Or when the number of unread bytes reaches 0, a signal to end
if num_lines >= n or not unread:
#Last found line feed code
leftmost_blank = re.search(rb'\r?\n', line)
#The part before the line feed code found last is unnecessary
line = line[leftmost_blank.end():]
#Convert byte string to string
line = line.decode()
#Line feed code'\r\n'Or\n'Separate with and convert to an array
lines = re.split(r'\r?\n', line)
#Finally take out n pieces from the back,Convert to float type and return
result = [list(map(float, line.split(','))) for line in lines[-n:]]
#If n is not specified, the last line is returned alone.
if not is_list:
return result[-1]
else:
return result
The explanation is given in the annotation. Now let's do the main time measurement.
timeit -n100 tail_b(fn, 3)
87.8 µs ± 3.74 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The best time so far was the first approach, which was 18.8 ms ± 175 µs
. It means that the execution time is about0.5%
. That is, 200
times, but 86400 lines from the beginning. It is natural that there is a big difference because it is the difference between reading all or reading a few lines from the back.
I introduced four patterns, but there seems to be another way to execute the system's tail
command using the subprocess
module. This is an environment-dependent method, so I omitted it this time.
The most recommended method I've introduced is to write in two lines using pandas
. Python is a language that lets you use the code of others to learn how you can enjoy yourself.
As for the method of reading from the back of the file, it is recommended to use it when you need speed or when the number of lines and characters is ridiculously large and it takes too much time to read the file from the beginning.
Also, it doesn't make any sense to use 64
to determine chunk_size
. It's probably fastest to set it to about the length of one line in a file, but some files vary greatly in length depending on the line. Therefore, I can't say anything.
If you're dealing with files like short lines with a few characters, but long lines with 10,000 characters, you'll need to change chunk_size dynamically.
For example, if the number of lines found in one search does not reach n, the next chunk_size is doubled and doubled.
It seems that it is also effective to determine the next chunk_size from the number of lines that have been searched and the average length of the lines.
Recommended Posts