Sort large text files

Is it usually like this?

Normally, the following method is used, but it consumes memory for the file size.

with open('/path/to/not-sorted-file', 'r') as fr:
    new_lines = sorted(fr.readlines()) #Expand to memory for file size

with open('/path/to/sorted-file', 'wb') as fw:
    fw.write(''.join(new_lines))

Corresponds to large size

I wrote a function. I'm doing something like ↓.

  1. Read the file line by line
  2. Save the following for each line and put them together in one list
  1. Sort the list of # 2
  2. Reopen the file fr
  3. Open the writer's file fw
  4. Foreach the list of # 3 and move the position of the file fr to the file offset value.
  5. line = fr.readline ()
  6. fw.write (line)

This method consumes memory for "substring x number of lines" (; _;)

import os
import uuid
import tempfile

def sort_large_file(filename, key=lambda l: l[:5]):
    '''
    sort large file without on-memory.

    :param str filename: abspath of file.
    :param function key: the function makes sort-key from a line.
    '''
    #Save the file before sorting.
    tmpname = os.path.join(tempfile.gettempdir(), 'sortlargefile_%s' % (uuid.uuid4().get_hex()))
    os.rename(filename, tmpname)

    # make a list of offsets.
    offset_list = []
    with open(tmpname, 'r') as fr:
        while True:
            offset = fr.tell()
            line = fr.readline()
            if not line:
                break

            keyword = key(line)
            offset_list.append((keyword, offset, ))

    # sort offsets.
    offset_list.sort(key=lambda e: e[0])

    # sort (write to new file).
    with open(filename, 'wb') as fw, open(tmpname, 'r') as fr:
        for keyword, offset in offset_list:
            fr.seek(offset)
            line = fr.readline()
            fw.write(line)

    # remove tmp.
    os.remove(tmpname)

Actually try

Call the function as follows.

> sort_large_file('/path/to/your/file', lambda l: l[:l.find(',')])

↓ This is the original CSV.

2016-10-01,apple,red
2016-09-29,orange,orange
2015-12-21,banana,yellow

The sort_large_file () function also requires a line break on the last line.

↓ It will be sorted like this.

2015-12-21,banana,yellow
2016-09-29,orange,orange
2016-10-01,apple,red

Recommended Posts

Sort large text files
Sort large text files in Python
Sort huge files with python
Convert a large number of PDF files to text files using pdfminer
Compress all the text files below!
Find large files / directories on Linux
sort
[Django] Download large files while saving memory.
Create a large text file with shellscript
How to find large files on Linux