Sort large text files in Python

[Merge sort]: https://ja.wikipedia.org/wiki/%E3%83%9E%E3%83%BC%E3%82%B8%E3%82%BD%E3%83%BC%E3% 83% 88 [io.IOBase.readlines]:http://docs.python.jp/3/library/io.html#io.IOBase.readlines

Background

Python scripts running in a Windows environment now need to sort text files.

The requirements are as follows.

--Run in Windows environment --The target file is a CSV file of several GB --The sort key has multiple columns, and there is also a numeric column.

The Windows command prompt has a [sort command] windows-sort that supports sorting huge files. However, unlike the [Linux sort command] linux-sort, you cannot specify a delimiter (-t) or sort as a number (-n).

If the size is small, you can use a Python script to load all the data into memory and call the sorted function, but if the size is large, you may run out of memory.

Since there is no help for it, I decided to implement a sort process corresponding to a huge file in Python.

Processing procedure

To sort a huge amount of data, use the concept of [Merge sort] as follows.

  1. Split
  2. Read enough lines from the input file to fit in memory
  3. Sort the read part and write it to a temporary file
  4. Repeat until all input files are read
  5. Merge
  6. Read one line from each temporary file
  7. Write the smallest value to the output file
  8. Repeat until all temporary files have been read

I have created a Processing Procedure Slide so that you can see how it works. マージソート_-_Qiita.png

Implementation method

Split

Use [io.IOBase.readlines] to split and read the file.

Using readlines with no arguments loads all lines into memory, so using readlines when the file size is large is known as an anti-pattern.

However, readlines can limit the number of bytes / characters to be read by the argument, which can be used to divide and read even a huge text file.

>>> from io import StringIO
>>> f = StringIO("11\n22\n3\n4\n5\n666666\n")
>>> f.readlines(5)
['11\n', '22\n']
>>> f.readlines(5)
['3\n', '4\n', '5\n']
>>> f.readlines(5)
['666666\n']
>>> f.readlines(5)
[]

It seems to be the number of bytes for files opened in binary mode and the number of characters for text mode.

merge

Use heapq.merge for the merge. heapq.merge returns the result of merging multiple sorted iterables.

>>> import heapq
>>> list(heapq.merge([1, 3, 4, 7], [2, 5], [6]))
[1, 2, 3, 4, 5, 6, 7]

Source

It is listed in gist. The Python version is 3.6.

The following is an excerpt of the sort part.

import heapq
import os
from contextlib import contextmanager
from operator import itemgetter
from tempfile import TemporaryDirectory, mktemp
from typing import IO, Callable, List


def large_sort(input_file: IO, output_file: IO, key: Callable=None, reverse: bool=False, limit_chars: int=1024*1024*64):

    with TemporaryDirectory() as tmp_dir:

        for lines in _read_parts(input_file, limit_chars):
            lines = sorted(lines, key=key, reverse=reverse)
            _write_part(lines, tmp_dir)

        with _open_tmp_files(tmp_dir) as tmp_files:
            for row in heapq.merge(*tmp_files, key=key, reverse=reverse):
                output_file.write(row)


def _read_parts(input_file, limit_chars):
    lines = input_file.readlines(limit_chars)
    while lines:
        yield lines
        lines = input_file.readlines(limit_chars)


def _write_part(lines, tmp_dir):
    tmp_filename = mktemp(dir=tmp_dir)
    with open(tmp_filename, "w") as tmp_file:
        tmp_file.writelines(lines)
    return tmp_filename


@contextmanager
def _open_tmp_files(tmp_dir):
    filenames = os.listdir(tmp_dir)
    files = [open(os.path.join(tmp_dir, filename), "r") for filename in filenames]
    try:
        yield files
    finally:
        for file in files:
            file.close()

In this requirement, I wanted a function to call from another Python script, so I don't need it, but in the gist source, I made it available from the command line by referring to the sort command of linux.

Other ideas

I made it myself this time, but I will write it because there are other solutions.

--Use a system environment that can use Linux commands such as Cygwin --This time, the execution environment was a non-engineer's PC, so it's hard to say "install Cygwin", so I rejected it. --It seems possible to extract and bundle only sort.exe and necessary dlls. --Use a free sort tool ――I searched lightly, but I couldn't find it that could be used from the command line and met the requirements. ――Maybe you can find it if you search properly --Powershell sort command --It seems to be more sophisticated than the sort command at the command prompt, but it has not been confirmed.

Recommended Posts

Sort large text files in Python
Sort large text files
Clustering text in Python
Bubble sort in Python
Text processing in Python
Custom sort in Python3
UTF8 text processing in python
Naturally sort Path in Python
Sort huge files with python
Speech to speech in python [text to speech]
Sort by date in python
Transpose CSV files in Python Part 1
GOTO in Python with Sublime Text 3
Manipulate files and folders in Python
Handling of JSON files in Python
Download Google Drive files in Python
Extract text from images in Python
Python # sort
Read files in parallel with Python
Export and output files in Python
Law of large numbers in python
Reading and writing text in Python
Extract strings from files in Python
When specifying multiple keys in python sort
New in Python 3.9 (2)-Sort directed acyclic graphs in Python
Find files like find on linux in Python
Output tree structure of files in Python
Type annotations for Python2 in stub files!
Automate jobs by manipulating files in Python
Read and write JSON files in Python
Sample for handling eml files in Python
Implemented Stooge sort in Python3 (Bubble sort & Quicksort)
Try text mining your diary in Python
Download files in any format using Python
Read text in images with python OCR
Quadtree in Python --2
Python in optimization
CURL in python
Geocoding in python
SendKeys in Python
Meta-analysis in Python
Unittest in python
Epoch in Python
Discord in Python
Sudoku in Python
DCI in Python
quicksort in python
nCr in python
N-Gram in Python
Programming in python
Plink in Python
Constant in python
Lifegame in Python.
FizzBuzz in Python
Sqlite in python
StepAIC in Python
N-gram in python
LINE-Bot [0] in Python
Csv in python
Disassemble in Python
Reflection in Python