What is the true identity of Python's sort method "sort"? ??

Introduction

I will write an article about Qiita after a long time. This is * nishiwakki . This time, I will post this article as the "12th day" frame of " Python Part 2 Advent Calendar 2020 *"! (I'm sorry just before posting and changing the date ... The color of the figure is totally unpleasant because I was in a hurry)

As a hobby, I use Python for a little play development and competition pros, but I often use ** sort ** at that time. When it comes to sorting, bubble sort and quicksort come to mind, but what sort is used by Python's sort method sort? It was because I suddenly thought.

#Try using the sort method sort
l = [5, 1, 3, 4, 2]
print(l) #Output result: [5, 1, 3, 4, 2]
l.sort()
print(l) #Output result: [1, 2, 3, 4, 5]

I hope that those who are "not interested in sorting" and those who say "a man is silent and bogosort" can read it!

From the conclusion

The sorting algorithm used in Python is called "** Timsort **". (From the official Python document "Sorting HOW TO")

... what? Please rest assured that you didn't know. I didn't even know ... (crying)

Timsort Profile

I investigated every corner from the top to the bottom of Tim.

item order Explanation
Average calculation time O(n \log n) Average time to sort
Best calculation time O(n) Time when the original arrangement is good and the fastest sorting is possible
Worst calculation time O(n \log n) Time when the original order is bad and sorting takes the longest
memory usage O(n) Also with spatial complexity. Amount of memory used for sorting
Stability Yes The order of data with the same value does not change before and after sorting

As for the amount of calculation of sort, the worst calculation time is often picked up. Since it is written in order notation, some people may find it difficult, but when compared with other sorts, it looks like this.

sort Worst calculation time
Bubble sort O(n^2)
Quick sort O(n^2)
Timsort O(n \log n)

And the order notation has such a fast / slow relationship.

[Fast] $ O (1) $ <$ O (\ log n) $ <$ O (n) $ <$ O (n \ log n) $ <$ O (n ^ 2) $ [Slow]

Tim-chan is a fast sorting algorithm, isn't it? Excellent! I'm also happy that it's stable!

The identity of Timsort

I know Tim is good, but who is he? In conclusion, Timsort is a ** hybrid ** of the following two famous sorts.

-Insert Sort -[Advantage] It is fast if it is aligned to some extent before sorting (best calculation time is fast) -Merge sort -[Advantage] Stable sort speed can be expected

The details of the two sorts are omitted in this article. In this article, we'll take a look at Timsort, a simple implementation! (The actual Python sort implementation is complicated)


Timsort has four main steps!

  1. ** Calculation of minRun **
  2. ** Divide the array based on minRun **
  3. ** Insertion sort is performed for each divided array **
  4. ** Merge sort on split sorted array **

[Caution] If the number of elements in the array you want to Timsort $ N $ is $ N <64 $, The array split in step 2 is not performed, only the insertion sort is performed on the input array and the process ends.

Let's look at each step.

STEP1. Calculation of minRun

Divides the input array to be sorted into an array called a subarray. The minimum number of elements in that subarray is called minRun and it is calculated.

1.001.jpeg

The value of minRun has the following three conditions.

-** For the number of elements N and minRun in the input array, $ \ frac {N} {minRun} $ is a power of 2 (2, 4, 8, 16, ...) or a value close to it. Being ** -** The value of minRun is not too large ($ minRun <256 ) ** -** The value of `minRun` is not too small ( minRun> 8 $) **

The one that meets the above conditions and gives the highest performance is $ 32 \ leqq minRun \ leqq 64 $. Use the following function to calculate this number from the number of elements in the input array N.

#Function to calculate minRun
def calcMinRun(n):
    r = 0
    while n >= 64:
        #If the least significant bit is 1, r=Become 1
        r |= n & 1
        #Shift n to the right by one digit(n //Something like 2)
        n >>= 1
    return n + r

It's hard to see at first glance, but the point is that if there is at least one 1 in the lower bits, excluding the upper 6 bits ($ 2 ^ 6 = 64 $) when N is expressed in binary. The value of r will be 1. (R is 0 or 1)

2.002.jpeg

Examples 1 and 2 above are minRun = 38 and minRun = 32, respectively.

STEP2. Divide the array based on minRun

Slice the input array 1 by the number of elements of minRun to make it a sub-array. In example 1 of STEP1, it will be as follows.

名称未設定.001.jpeg

The last subarray may be smaller than minRun, but we'll just handle it (in this article). (I'll add a little explanation in "The actual Timsort is more amazing" below.)

In addition, $ \ frac {N} {minRun} = 31.921 ... $, which is close to $ 2 ^ 5 = 32 $.

3. Perform insertion sort for each divided array

Performs ** insertion sort ** on all divided sub-arrays. In this article, we will use a normal insertion sort program.

Insertion sort function
#Function that performs insertion sort
def insertionSort(arr, l, r):
    for i in range(l+1, r+1):
        j = i
        while j > l and arr[j] < arr[j-1]:
            arr[j], arr[j-1] = arr[j-1], arr[j]
            j -= 1

Now, let's actually see the contents of the steps so far programmatically!

#Functions that perform Timsort
def timSort(arr):
    #Get the length of the list
    n = len(arr)
    #[STEP1] Calculate minRun based on length
    minRun = calcMinRun(n)
    #[STEP2] Divide into sub-arrays(Repeatedly executed for each length of minRun)
    for start in range(0, n, minRun):
        #Prevent out-of-range access to the list during the final subarray
        end = min(start + minRun - 1, n - 1)
        #[STEP3] Insertion sort performed in the sub array
        insertionSort(arr, start, end)
    :
    #[STEP4] Merge sort is performed on the divided sorted array.
    :

4. Perform merge sort on the divided sorted array

The procedure for merging is the same as the general merge sort procedure. It's already split, so ** just merge ** is enough! The image looks like this.

名称未設定.001.jpeg

Merge sort function (merge only)
#Merge sort(Not implemented)Function that implements
def mergeSort(arr, l, m, r):
    left, right = arr[l:m+1], arr[m+1:r+1]
    for i in range(l, r+1):
        if len(left) == 0:
            arr[i] = right.pop(0)
        elif len(right) == 0:
            arr[i] = left.pop(0)
        elif left[0] <= right[0]:
            arr[i] = left.pop(0)
        else:
            arr[i] = right.pop(0)

Let's take a look at the continuation of the previous program!

def timSort(arr):
    #[STEP1] Calculate minRun based on length
    :
    #[STEP2] Divide into sub-arrays(Repeatedly executed for each length of minRun)
    :
        #[STEP3] Insertion sort performed in the sub array
        insertionSort(arr, start, end)

    #Prepare the variable size (counting the total number of elements in the merged subarray)
    size = minRun
    while size < n:
        #Get the leftmost index of the two subarrays to merge
        for left in range(0, n, 2 * size):
            #Get the central index of the two subarrays to merge
            mid = min(n - 1, left + size - 1)
            #Get the rightmost index of the two subarrays to merge
            right = min((left + 2 * size - 1), (n - 1))
            #Perform merge sort
            mergeSort(arr, left, mid, right)
        size = 2 * size

The above is a simple implementation of Timsort in Python.


Tim-chan, even a simple implementation is complicated and difficult ... (crying)

The actual Timsort is even more amazing (added later)

**sorry! I will change the date, so I will upload it here today! I will add it later ... ** ** If you like it, you will be more motivated to write! (Begging for life) (I'm sorry I got sick) **

Recommended Posts

What is the true identity of Python's sort method "sort"? ??
What is the cause of the following error?
[Introduction to Python] What is the method of repeating with the continue statement?
Regarding Python's Sort method
What is Python's __init__.py?
What is a recommend engine? Summary of the types
The copy method of pandas.DataFrame is deep copy by default
What is the Linux kernel?
What is the interface for ...
What is the Callback function?
What is the default TLS version of the python requests module?
Numerical approximation method when the calculation of the derivative is troublesome
Is the probability of precipitation correct?
[Python] What is @? (About the decorator)
What kind of Kernel is this Kernel?
What is the X Window System?
What is the python underscore (_) for?
Science "Is Saito the representative of Saito?"
[Linux] What is the method to solve the package dependency error of yum and rpm in the actual workplace?
What is Newton's method? ?? Approximate solution of equation to be solved by Newton's method
[Linux] What is the host name confirmation method other than the hostname command?
What is the XX file at the root of a popular Python project?
[Example of Python improvement] What is the recommended learning site for Python beginners?
What kind of book is the best-selling "Python Crash Course" in the world?
Find out the name of the method that called it from the method that is python
Count / verify the number of method calls.
What kind of programming language is Python?
What is the ETL processing framework clivoa?
[Unix] What is the zombie process / orphan process?
What is "mahjong" in the Python library? ??
[python] [meta] Is the type of python a type?
[Machine learning] What is the LP norm?
The update of conda is not finished.
The backslash of the Japanese keyboard is "ro"
In Python, change the behavior of the method depending on how it is called
What to do if (base) is displayed at the beginning of the Mac terminal
What is the difference between `pip` and` conda`?
[Python] Sort the list of pathlib.Path in natural sort
About the accuracy of Archimedean circle calculation method
The answer of "1/2" is different between python2 and 3
What is wheezy in the Docker Python image?
Sort the elements of the array by specifying the conditions
The origin of Manjaro Linux is "Mount Kilimanjaro"
FAQ: Why is the comparison of numbers inconsistent?
The value of pyTorch torch.var () is not distributed
This is the only basic review of Python ~ 1 ~
This is the only basic review of Python ~ 2 ~
What is bucket sort? Merideme and code example
It's a Mac. What is the Linux command Linux?
(Linux beginner) What is the magic word aux?
This is the only basic review of Python ~ 3 ~
What is equal repayment of principal and interest and equal repayment of principal and interest?
What is the difference between Unix and Linux?
[Django] I didn't quite understand what the reverse method is, so read the official documentation.
What to do if the progress bar is not displayed in tqdm of python