I will write an article about Qiita after a long time. This is * nishiwakki . This time, I will post this article as the "12th day" frame of " Python Part 2 Advent Calendar 2020 *"! (I'm sorry just before posting and changing the date ... The color of the figure is totally unpleasant because I was in a hurry)
As a hobby, I use Python for a little play development and competition pros, but I often use ** sort ** at that time. When it comes to sorting, bubble sort and quicksort come to mind, but what sort is used by Python's sort method sort
? It was because I suddenly thought.
#Try using the sort method sort
l = [5, 1, 3, 4, 2]
print(l) #Output result: [5, 1, 3, 4, 2]
l.sort()
print(l) #Output result: [1, 2, 3, 4, 5]
I hope that those who are "not interested in sorting" and those who say "a man is silent and bogosort" can read it!
The sorting algorithm used in Python is called "** Timsort **". (From the official Python document "Sorting HOW TO")
... what? Please rest assured that you didn't know. I didn't even know ... (crying)
I investigated every corner from the top to the bottom of Tim.
item | order | Explanation |
---|---|---|
Average calculation time | Average time to sort | |
Best calculation time | Time when the original arrangement is good and the fastest sorting is possible | |
Worst calculation time | Time when the original order is bad and sorting takes the longest | |
memory usage | Also with spatial complexity. Amount of memory used for sorting | |
Stability | Yes | The order of data with the same value does not change before and after sorting |
As for the amount of calculation of sort, the worst calculation time is often picked up. Since it is written in order notation, some people may find it difficult, but when compared with other sorts, it looks like this.
sort | Worst calculation time |
---|---|
Bubble sort | |
Quick sort | |
Timsort |
And the order notation has such a fast / slow relationship.
[Fast] $ O (1) $ <$ O (\ log n) $ <$ O (n) $ <$ O (n \ log n) $ <$ O (n ^ 2) $ [Slow]
Tim-chan is a fast sorting algorithm, isn't it? Excellent! I'm also happy that it's stable!
I know Tim is good, but who is he? In conclusion, Timsort is a ** hybrid ** of the following two famous sorts.
-Insert Sort -[Advantage] It is fast if it is aligned to some extent before sorting (best calculation time is fast) -Merge sort -[Advantage] Stable sort speed can be expected
The details of the two sorts are omitted in this article.
In this article, we'll take a look at Timsort, a simple implementation! (The actual Python sort
implementation is complicated)
Timsort has four main steps!
[Caution] If the number of elements in the array you want to Timsort $ N $ is $ N <64 $, The array split in step 2 is not performed, only the insertion sort is performed on the input array and the process ends.
Let's look at each step.
Divides the input array to be sorted into an array called a subarray.
The minimum number of elements in that subarray is called minRun
and it is calculated.
The value of minRun
has the following three conditions.
-** For the number of elements N
and minRun
in the input array, $ \ frac {N} {minRun} $ is a power of 2 (2, 4, 8, 16, ...) or a value close to it. Being **
-** The value of minRun
is not too large ($ minRun <256
The one that meets the above conditions and gives the highest performance is $ 32 \ leqq minRun \ leqq 64 $.
Use the following function to calculate this number from the number of elements in the input array N
.
#Function to calculate minRun
def calcMinRun(n):
r = 0
while n >= 64:
#If the least significant bit is 1, r=Become 1
r |= n & 1
#Shift n to the right by one digit(n //Something like 2)
n >>= 1
return n + r
It's hard to see at first glance, but the point is that if there is at least one 1
in the lower bits, excluding the upper 6 bits ($ 2 ^ 6 = 64 $) when N
is expressed in binary. The value of r
will be 1
. (R
is 0
or 1
)
Examples 1 and 2 above are minRun = 38
and minRun = 32
, respectively.
Slice the input array 1 by the number of elements of minRun
to make it a sub-array.
In example 1 of STEP1, it will be as follows.
The last subarray may be smaller than minRun
, but we'll just handle it (in this article).
(I'll add a little explanation in "The actual Timsort is more amazing" below.)
In addition, $ \ frac {N} {minRun} = 31.921 ... $, which is close to $ 2 ^ 5 = 32 $.
Performs ** insertion sort ** on all divided sub-arrays. In this article, we will use a normal insertion sort program.
#Function that performs insertion sort
def insertionSort(arr, l, r):
for i in range(l+1, r+1):
j = i
while j > l and arr[j] < arr[j-1]:
arr[j], arr[j-1] = arr[j-1], arr[j]
j -= 1
Now, let's actually see the contents of the steps so far programmatically!
#Functions that perform Timsort
def timSort(arr):
#Get the length of the list
n = len(arr)
#[STEP1] Calculate minRun based on length
minRun = calcMinRun(n)
#[STEP2] Divide into sub-arrays(Repeatedly executed for each length of minRun)
for start in range(0, n, minRun):
#Prevent out-of-range access to the list during the final subarray
end = min(start + minRun - 1, n - 1)
#[STEP3] Insertion sort performed in the sub array
insertionSort(arr, start, end)
:
#[STEP4] Merge sort is performed on the divided sorted array.
:
The procedure for merging is the same as the general merge sort procedure. It's already split, so ** just merge ** is enough! The image looks like this.
#Merge sort(Not implemented)Function that implements
def mergeSort(arr, l, m, r):
left, right = arr[l:m+1], arr[m+1:r+1]
for i in range(l, r+1):
if len(left) == 0:
arr[i] = right.pop(0)
elif len(right) == 0:
arr[i] = left.pop(0)
elif left[0] <= right[0]:
arr[i] = left.pop(0)
else:
arr[i] = right.pop(0)
Let's take a look at the continuation of the previous program!
def timSort(arr):
#[STEP1] Calculate minRun based on length
:
#[STEP2] Divide into sub-arrays(Repeatedly executed for each length of minRun)
:
#[STEP3] Insertion sort performed in the sub array
insertionSort(arr, start, end)
#Prepare the variable size (counting the total number of elements in the merged subarray)
size = minRun
while size < n:
#Get the leftmost index of the two subarrays to merge
for left in range(0, n, 2 * size):
#Get the central index of the two subarrays to merge
mid = min(n - 1, left + size - 1)
#Get the rightmost index of the two subarrays to merge
right = min((left + 2 * size - 1), (n - 1))
#Perform merge sort
mergeSort(arr, left, mid, right)
size = 2 * size
The above is a simple implementation of Timsort in Python.
Tim-chan, even a simple implementation is complicated and difficult ... (crying)
**sorry! I will change the date, so I will upload it here today! I will add it later ... ** ** If you like it, you will be more motivated to write! (Begging for life) (I'm sorry I got sick) **
Recommended Posts