Various ways to read the last line of a csv file in Python

On Linux there is a command called tail that allows you to get the n line from the end of the file. It's pretty convenient so I want to be able to do the same with Python. I would like to create a function that retrieves n lines from the end of a file with tail (file_name, n) using several approaches.

For the last approach, go to the site it-swarm.dev Efficiently find the last line in your text file -I refer to the page swarm.dev/ja/python/ Efficiently find the last line of a text file / 940298444 /).

File to use

The file to be read could be a text file, but this time I will use the csv file. The file name is test.csv. The content is a summary of Bitcoin prices for 86400 lines (one day) per second.

`test.csv`


date,price,size
1588258800,933239.0,3.91528007
1588258801,933103.0,3.91169431
1588258802,932838.0,2.91
1588258803,933217.0,0.5089811

(Omission)

1588345195,955028.0,0.0
1588345196,954959.0,0.05553
1588345197,954984.0,1.85356
1588345198,955389.0,10.91445135
1588345199,955224.0,3.61106

Although it has nothing to do with the main subject, if you explain each item for the time being, the units of date, price, and size are UnixTime, YEN, BTC. The first line means that at time 1588258800, that is, at 0:00:00 on May 1st, 3.915280007 of Bitcoin was bought and sold for 933239.0 yen.

Read from the beginning obediently

First, use the built-in function ʻopen ()` to get the file object, read all the lines from the beginning, and output only the last n lines. If n is 0 or a negative integer, strange results will be obtained, so it is actually necessary to perform processing limited to natural numbers, but it is important to make it easy to see.

`python`


def tail(fn, n):
    #Open the file and get all the lines in a list
    with open(fn, 'r') as f:
        #Read one line.The first line is the header, so discard the result
        f.readline()

        #Read all lines
        lines = f.readlines()
    
    #Returns only n lines from the back
    return lines[-n:]

#result
file_name = 'test.csv'
tail(file_name, 3)
# ['1588345197,954984.0,1.85356\n',
#  '1588345198,955389.0,10.91445135\n',
#  '1588345199,955224.0,3.61106\n']

If it is a text file, you can leave it as it is, but make it a little easier to use for csv files.

`python`


def tail(fn, n):
    #Open the file and get all the lines in a list
    with open(fn, 'r') as f:
        f.readline()
        lines = f.readlines()

    #Return a string as an array.By the way str->Type convert to float
    return [list(map(float ,line.strip().split(','))) for line in lines[-n:]]

#result
tail(file_name, 3)
# [[1588345197.0, 954984.0, 1.85356],
#  [1588345198.0, 955389.0, 10.91445135],
#  [1588345199.0, 955224.0, 3.61106]]

The only thing that has changed is the return line, but the functions are so crowded that it's hard to understand, so I'll break it down. The following processing is performed for each line.

Remove the line feed code with strip () '1588345197,954984.0,1.85356\n' -> '1588345197,954984.0,1.85356'
Convert strings to arrays separated by commas with split () '1588345197,954984.0,1.85356' -> ['1588345197', '954984.0', '1.85356']
Convert each element of the array from a string to a float type with map () ['1588345197', '954984.0', '1.85356'] -> [1588345197.0, 954984.0, 1.85356]

Use csv module

Since the csv module automatically converts each line to an array, the processing will be a little slower, but it can be described more concisely.

`python`


import csv

def tail_csv(fn, n):
    with open(fn) as f:
        #Convert file object to csv reader
        reader = csv.reader(f)
        #Discard the header
        next(reader)
        #Read all lines
        rows = [row for row in reader]

    #Float only the last n lines and return
    return [list(map(float, row)) for row in rows[-n:]]

Use the pandas module

Since pandas has a tail function, it is surprisingly easy to write.

`python`


import pandas as pd

def tail_pd(fn, n):
    df = pd.read_csv(fn)
    return df.tail(n).values.tolist()

Since pandas deals with numpy arrays, it is converted to a list at the end with tolist (). It is not necessary if you can keep the numpy array.

Measure the execution time for each pattern

ʻIpython has a convenient command called timeit`, so let's compare it with the number of loops set to 100.

timeit -n100 tail('test.csv', 3)
18.8 ms ± 175 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

timeit -n100 tail_csv('test.csv', 3)
67 ms ± 822 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

timeit -n100 tail_pd('test.csv', 3)
30.4 ms ± 2.45 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

It turned out that it was quick to read as it was without using any module. Cospa seems to be the best because pandas is the simplicity of the code and the speed is reasonable. Since the csv module purposely converts from a character string to an array up to the unused line, the result is extremely poor.

If you read the file from behind, it will be a moment

All of the approaches so far are reading all the lines after all. However, I want the last few lines, so if there is a way to read the file from the back, the reading should be completed in an instant. Refer to the page [Efficiently find the last line of a text file](https://www.it-swarm.dev/ja/python/ Efficiently find the last line of a text file / 940298444 /) did. Read about 100 bytes at a time from the back, and if a line feed code is found, the character string after that is the last line. Only the last line is found in the page, but to realize the tail command You need to find the n line from the back, so adjust only there.

First, as a preliminary knowledge, we will explain how to operate the file pointer. There are three functions to use: f.tell (), f.read (size), and f.seek (offset, whence). f.tell () returns the position currently pointed to by the pointer. f.read (size) returns the contents read size bytes from the current position. The pointer moves to the read position. It can only be advanced in the positive direction. f.seek (offset, whence) is a function that moves the position of the pointer. The argument whence represents the position. One of the values 0, 1, 2 is entered. 0 is the beginning of the file, 1 is the current pointer position, and 2 is the end of the file. Means. Enter an integer for ʻoffset. Unlike read, you can also pass a negative value, so for example, f.seek (-15, 1)` returns the current pointer position to the beginning by 15.

We will implement it based on these.

`python`


#Use split that can use regular expressions
import re

def tail_b(fn, n=None):
    #If n is not given, only the last line is returned alone.
    if n is None:
        n = 1
        is_list = False
    #n is a natural number
    elif type(n) != int or n < 1:
        raise ValueError('n has to be a positive integer')
    #When n is given, n rows are returned together in a list.
    else:
        is_list = True

    # 128 *Read n bytes at a time
    chunk_size = 64 * n

    # seek()Behaves unexpectedly except in binary mode'rb'To specify
    with open(fn, 'rb') as f:
        #First line to find the leftmost position excluding the header(Header line)I Read
        f.readline()
        #The very first line feed code is at the left end(End when reading from the end of the file)To
        # -1 is'\n'1 byte
        left_end = f.tell() - 1
        
        #End of file(2)1 byte back from. read(1)To read in
        f.seek(-1, 2)
        
        #Because there are often blank lines and spaces at the end of the file
        #Position of the last character in the file excluding them(Right end)Find
        while True:
            if f.read(1).strip() != b'':
                #Right end
                right_end = f.tell()
                break
            #Take one step, so take two steps down
            f.seek(-2, 1)
        
        #Number of bytes remaining unread to the far left
        unread = right_end - left_end
        
        #Number of lines read.If this becomes n or more, it means that n lines have been read.
        num_lines = 0

        #Variable for connecting the read byte strings
        line = b''
        while True:
            #The number of unread bytes is chunk_When it becomes smaller than size,Chunk fraction_size
            if unread < chunk_size:
                chunk_size = f.tell() - left_end
            
            #Chunk from your current location_Move to the top of the file by size
            f.seek(-chunk_size, 1)
            
            #Read only the amount you moved
            chunk = f.read(chunk_size)

            #Connect
            line = chunk + line

            #Since I proceeded again with read, chunk again at the beginning_size move
            f.seek(-chunk_size, 1)
            
            #Update the number of unread bytes
            unread -= chunk_size

            #If a line feed code is included
            if b'\n' in chunk:
                #Num for the number of line feed codes_Count up lines
                num_lines += chunk.count(b'\n')

                #Read more than n lines,Or when the number of unread bytes reaches 0, a signal to end
                if num_lines >= n or not unread:
                    #Last found line feed code
                    leftmost_blank = re.search(rb'\r?\n', line)

                    #The part before the line feed code found last is unnecessary
                    line = line[leftmost_blank.end():]

                    #Convert byte string to string
                    line = line.decode()

                    #Line feed code'\r\n'Or\n'Separate with and convert to an array
                    lines = re.split(r'\r?\n', line)

                    #Finally take out n pieces from the back,Convert to float type and return
                    result = [list(map(float, line.split(','))) for line in lines[-n:]]

                    #If n is not specified, the last line is returned alone.
                    if not is_list:
                        return result[-1]
                    else:
                        return result

The explanation is given in the annotation. Now let's do the main time measurement.

timeit -n100 tail_b(fn, 3)
87.8 µs ± 3.74 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The best time so far was the first approach, which was 18.8 ms ± 175 µs. It means that the execution time is about0.5%. That is, 200 times, but 86400 lines from the beginning. It is natural that there is a big difference because it is the difference between reading all or reading a few lines from the back.

in conclusion

I introduced four patterns, but there seems to be another way to execute the system's tail command using the subprocess module. This is an environment-dependent method, so I omitted it this time. The most recommended method I've introduced is to write in two lines using pandas. Python is a language that lets you use the code of others to learn how you can enjoy yourself.

As for the method of reading from the back of the file, it is recommended to use it when you need speed or when the number of lines and characters is ridiculously large and it takes too much time to read the file from the beginning. Also, it doesn't make any sense to use 64 to determine chunk_size. It's probably fastest to set it to about the length of one line in a file, but some files vary greatly in length depending on the line. Therefore, I can't say anything. If you're dealing with files like short lines with a few characters, but long lines with 10,000 characters, you'll need to change chunk_size dynamically. For example, if the number of lines found in one search does not reach n, the next chunk_size is doubled and doubled. It seems that it is also effective to determine the next chunk_size from the number of lines that have been searched and the average length of the lines.