I studied TimSort ~ Sophisticated sorting method that is implemented as standard in Java and Python ~ (Part 1)

How did you start studying Tim Sort?

Initially, I was thinking of sorting a pair of two elements ($ m (m-1) / 2 $ pairs) in a set of $ m $ elements, and it was already $ O (m ^ 2). Since there were $ pairs, I thought it would take time if I didn't devise a sort, so I implemented heapsort and sorted it myself. (Heapsort is worst $ O (n \ log n) \ sim O (m ^ 2 \ log m ^ 2) $. On the other hand, naive bubble sort and insertion sort are $ O (n ^ 2) \ sim O ( It costs m ^ 4) $. Here, the number of pairs is described as $ n \ sim m ^ 2 $)

However, looking back, Java and Python are evolving day by day, so I thought that the sorting algorithm used in the standard implementation should be evolving, so I investigated it. I didn't see much need to keep it as a simple method like bubble sort.

As expected, it was only being devoted every day, and it implemented a very sophisticated sorting algorithm. My name is Tim Sort.

Program language that implements TimSort

TimSort is said to have been implemented for the first time in 2002 by Tim Peter in Python. Tim Peter is also known as the author of "The Zen of Python," a well-known collection of Python sayings.

(Reference: Wikipedia https://en.wikipedia.org/wiki/Timsort)

In addition, TimSort is also used for Android Platform, GNU Octave, V8, etc.

Overview of Tim Sort

Gaurav Sen's explanation video of Tim Sort was very easy to understand. Reference: https://www.youtube.com/watch?v=emeME__917E Please note that Indian English has a strong habit. (On the contrary, Japanese English seems to be quite addictive to Indians and cannot be heard.) I would like to summarize here what I have studied mainly with reference to this video.

Tim Sort's worst calculation time

is. It's not much different from heapsort or merge sort ($ O (n \ log n)) $, but it's a bit better than these.

In a word, the characteristics of Tim Sort

To put it roughly so that you can easily get an image,

I think it can be said.

As I said earlier, there is no significant improvement in worst-case calculation time, like the comparison between bubble sort $ (O (n ^ 2)) $ and heapsort $ (O (n \ log n)) $. However, it can be said that it is a combination of some improved sorting algorithms and devised to be the best in all cases. (I don't think it's always the best)

In some cases, choosing the best sorting method should be welcomed in ** when a variety of use cases such as programming languages are expected **. If you want to cover a variety of use cases, the amount of data may be large or small. I think the point is to provide a faster sorting method even when it is small.

Tim Sort procedure overview

To put it very roughly

  1. Group the columns to be sorted into a set of small sorted columns (runs). ● Use insertion sort, which is more efficient than merge sort for small sizes, when grouping into columns

  2. Merge run using merge sort, but in that case ● Adopted the merge order of runs that can suppress the number of merges (stack using inequality of relation close to Fibonacci number) ● Efficient merging procedure (rangeless binary search, gallop mode, merging adjacent memory areas, etc.) We will proceed efficiently using.

I think it will be. Below, I will look at the details of the procedure.

1. Divide the column to be sorted into a set of small columns that have already been sorted, "run".

1-1. Sorted small column "run" (number of elements 32 to 64)

First, when we receive a column of elements to sort, we aim to sort it into a set of small sorted columns "run".

Here, why is the size 32 to 64? ** This is because the insertion sort is the upper limit of the size that makes it the fastest sorting algorithm. ** **

1-2. Comparison of insertion sort and other algorithms

The table below summarizes what can be considered as a basic sorting method.

Worst calculation time O(n^2) O(n\log n)
Method Insertion sort
Bubble sort
Selection sort
Merge sort
Quick sort
Heapsort

At first glance, it seems that the $ O (n \ log n) $ method on the right should be used in a unified way, but why use the insertion sort on the left? (I'm sorry for those who understand it.)

It should be kept in mind that this calculation time estimate is valid when ** $ n $ is large enough **. Where the data is large, the number of times the typical processing of the algorithm is repeated is effective, but when the number of data is small, how complicated the algorithm is processing is more than the factor that depends on the number of data. It works. It is due to the complexity of the algorithm-specific processing, which does not depend on the number of data, and is a term of $ O (1) $. In general, the $ O (n ^ 2) $ algorithm was developed earlier, and the one-time processing procedure itself is simple, so those $ O (1) $ factors are smaller.

What is the meaning of this? Make the number of this estimate finer. Coefficient factor $ c_ {i, b, s, m, q, h}, \ tilde {c} _ {i, b, s, m, q, h of $ O (1) $ Let's think about it by including} $.

-$ O (n ^ 2) $ method --Insert sort: $ c_i n ^ 2 + \ tilde {c} _i n $ --Bubble sort: $ c_b n ^ 2 + \ tilde {c} _b n $ --Selection sort: $ c_s n ^ 2 + \ tilde {c} _s n $

-$ O (n \ log n) $ method --Merge sort: $ c_m n \ log n + \ tilde {c} _m n $ --Quicksort: $ c_q n \ log n + \ tilde {c} _q n $ --Heapsort: $ c_h n \ log n + \ tilde {c} _h n $

Below this is a factor of $ n $, which is a common reading factor, for simplification when competing for the time required for the $ O (n \ log n) $ algorithm and the $ O (n ^ 2) $ algorithm. I would like to go out and compare. In this case, where $ n $ is small, the behavior of the worst time $ f (n) $ that each takes is roughly as shown in the graph below.

グラフのイメージ.png

Where the size of the data is small, the $ O (1) $ factor introduced earlier, which has nothing to do with the size of the data, is effective in time. Basically, the coefficient factor $ \ tilde {c}, c_ {i, b, s} $ of the $ O (n ^ 2) $ algorithm is better than the $ c_ {of the $ O (n \ log n) $ algorithm. It is smaller than m, q, h} $, $ \ tilde {c} _ {m, q, h} $ because it is simpler to process. Eventually, as the data becomes larger, the derivative of $ \ log n $ becomes smaller and the order is reversed, but where the data size is smaller, the gradient of the $ \ log n $ function becomes $ 1 / n. My understanding is that the $ O (n ^ 2) $ algorithm has a reasonable interval before it gets bigger because $ isn't small enough.

Thus, in small areas of data, the $ O (1) $ coefficient, which is determined by how complex the algorithm is, is effective. Then, when the data is small, it is necessary to adopt an algorithm that performs simple processing such that such a coefficient is the smallest. The answer is ** insertion sort **. (As a non-coefficient view, insertion sort is faster than bubble sort or selection sort because it can reduce the number of comparisons (sabo) in some cases. Currently, it is a full sort of a given column, so the part Sorting is not considered.)

Therefore, in order to create the first sorted subsequence run, we will create it by insertion sort while the number of data that is the fastest insertion sort is guaranteed. It is said that the number of elements is about 32 to 64. ** Therefore, the number of elements in run is set to 32 ~ 64.

1-3. Improvement of insertion sort

TimSort uses insertion sort when the data is small, as I explained earlier, but I want to make it even faster even with fast insertion sort, so I use ** half-insertion sort **. Binary insertion sort is an improved algorithm for insertion sort that performs a binary search to search for the insertion position.

1-4. Locality of the element that generates the run

Here, when forming one run, the run is drawn up by inserting and sorting only the elements that exist nearby in the original given column. Therefore, insertion sort does not search for all elements in a column of size $ n $ to search for elements to insert into run. Therefore, if the number of elements in the column to be sorted is large, $ n $, the order of time required for this operation will end at about $ O (n) $, which is the number of runs. This order is a small term compared to the time required for merge sort that will appear later, and it does not work for worst time estimation.

2. Merge using merge sort.

2-1. How to efficiently combine sort subsequences?

The debate is how to reduce the number of joins to form a column of the original size. It helps, surprisingly, to take advantage of the same regularity as the growth of living things in nature. It is a sequence called the ** Fibonacci sequence **. This sequence has a pattern in which organisms, especially plants, grow more efficiently, and the fact that they grow steadily means that if this is adopted, large aggregates can be formed efficiently and in a short time.

The Fibonacci sequence is described by the following recurrence formula:

\begin{equation}
a_{m+2} = a_{m+1} + a_{m}
\end{equation}

In this recurrence formula, when the first number term is $ a_0 = 0, , a_1 = 1 $, the behavior when $ m \ gg 1 $ is

\begin{equation}
a_{m} \sim \frac{\phi^m}{\sqrt{5}}, \hspace{2cm} \phi = \frac{\sqrt{5} + 1}{2} \sim 1.6 > 1
\end{equation}

It will be. From this, we can see that it can grow in the recurrence step of $ m \ sim \ log n $ until it grows to the target size $ n $.

In other words, the point of TimSort is that if you perform run join processing with a structure close to the Fibonacci sequence, you will be able to make a final list with the number of merges of $ \ log n $.

Of course, if you make a strict Fibonacci sequence, it will take time and effort to make it a structure, and it is wasteful. I don't want an exact Fibonacci sequence. Therefore, the point is to carry out a merge process that makes the structure close to the Fibonacci sequence reasonably close to nature **. Even if you are far from Fibonacci at first, you can get closer to that sequence as you carry out the process. The order estimate is a rough estimate of $ n \ gg 1 $ (it should be Fibonacci when $ n $ grows).

That is the "use of the stack using inequalities close to the Fibonacci sequence" explained next.

2-2. Use of stack with inequalities close to Fibonacci sequence

Merge the run while putting it on the stack. (I will add the stack diagram later if I have some spare capacity, but if you can refer to it on wikipedia for a while and get an image)

Reference (stack): https://en.wikipedia.org/wiki/%E3%82%B9%E3%82%BF%E3%83%83%E3%82%AF

The stack basically has a bucket-like structure with the same entrance and exit, in which new data is put in from the top and stacked on top. At the bottom is the structure where the old members are placed. When pulling data off the stack, it is pulled from the top data.

Here, I put it on the stack and merge the columns in the stack, but when I finish putting all the data and enter the procedure of the final stage of merging, the run that is in the stack We aim to keep the size relations so that they satisfy the following mirror mochi-like relations. Here, the subscript $ 0 $ indicates the top element at the top of the stack, and the higher the number, the lower the data. Here, let the symbol representing run be $ r_ {i} (i = 0, 1, \ cdots) $.

\begin{align}
r_{n+2} &> r_{n+1} + r_{n} \\
r_{n+1} &> r_{n}
\end{align}

If you look at the very first equation, this relation is not a Fibonacci sequence, but it looks roughly like a similar form and a relaxed version of the Fibonacci sequence. If these two inequalities are satisfied, the data array will be in the stack close to the Fibonacci column, and the final merging procedure will be $ \ log n $.

The remaining question is what to do to get to this state of inequality. It's equivalent to doing the following in the supernatant of the stack every time you put data on the stack: (Java-like, fake algorithm is written here) Name the array that has run as an element in the stack as runs. Imagine that the index of the array is added in ascending order from top.

if(runs.length >= 2){
  if(runs[2].length > runs[1].length + runs[0].length){
    if(runs[1].length > runs[0].length){
       break;
    }
    else{
      merge(runs[0], runs[1]);
    }
 }
 else{
   merge(runs[1], min(runs[0], runs[2])); //min is runs[0]And runs[2]This is the only fake notation that represents the shorter of
 }

}

In the above algorithm, the merge operation is performed so that the supernatant part approaches the inequality when new data is put on the stack. Continuing to manipulate the supernatant data like this will eventually lead to a Fibonacci-like runs column.

This has a structure called a stack that allows data to flow from above. The image here is to pour the unmelted okonomiyaki dough on the iron plate little by little, and use a spatula or something so that the lower dough takes up a larger area than the upper dough. I think it's similar to doing. From the beginning of pouring data into the stack, if you continue pouring while pressing with a "spatula" so that the bottom is longer so that the supernatant alone is stable, the old data sequence without sinking will be Since it is held down by the merge operation for a long time, it becomes a better and longer line, and a stable Fibonacci sequence of run lines can be naturally formed. In other words, ** the old data has the property of a stack that stagnates at the bottom, and the run with a large number of elements that has been exposed to the merge operation for a long time naturally sinks under the stack. ** is the key.

By constructing a data set of run in Fibonacci state so as not to overdo it in this way, the number of merge operations is limited to $ \ log n $.

2-3. Merge sort method and worst time per merge sort

After that, the worst calculation time for one merge sort becomes important. In fact, thanks to this already sorted run, the execution time of a merge sort per ** can be kept to $ O (n) $. ** **

In merge sort, the memory area that stores the array after merging is first held (the continuous memory area originally used by the two runs), and the memory area there is contained in the combined elements of the two runs. I will put them in order from the smallest one. First, name the two runs as side A being compared and side B being compared. The comparison side A first picks up one by one as "elements of A to be compared" in order from the smallest. (** run is already sorted, so the time it takes for this pickup is O (1) .) That one element is compared in order from the smallest element in B. ( Again, picking up from the smallest element does not take much time because it is sorted. ) Here, the memory that stores the array with the smaller one in comparison I will put it in the area. If the element of B is small, the "element of A to be compared" is left as it is, and the smaller element of B is inserted. Then, the next smaller element in B is compared with the element in A, and the smaller element of A or B is stored in the memory area that also stores the merged array. If the "element of A to be compared" is small, it is stored in the memory area and the next smaller element of A is brought. Then, we will start the comparison from the smallest element of B that is not in memory. ( The details around here will be very long, so I'll try to explain with figures and concrete examples in the sequel again. **) The point here is ** because it has already been sorted. The time taken is about the number of comparisons, which is equal to the number of elements in the merge-sorted area (because they are included at the same time as the comparison). In other words, the time it takes at worst is O (n) **.

From these calculations, the worst calculation time of TimSort is about $ O (n \ log n) $ by multiplying the calculation time $ O (n) $ per time by the number of joins $ O (\ log n) $. I think it will be.

2-3-1. Shouldn't we use binary search?

The above explanation is explained by linear search, and I think that there is a rush that it may not be very efficient. However, ** this is a case-by-case basis, and in some cases a binary search may take longer. ** For example, if two runs are similar in size and one element and the other element are close to each other. In such a case, when a binary search is performed, a comparison with a partner whose size is completely different so that the magnitude relationship does not have to be considered at all is also performed, and $ \ log n $ for one element. We will carry out the number of times. If you do that for all the elements, it will cost you $ O (n \ log n) $ in the worst case. Therefore, it is better not to carry out a ** easy binary search. ** If you do that, linear search is more efficient when merging already sorted runs.

However, on the contrary, if the size of the two runs is biased or the size distribution of the elements is biased, there are clearly useless searches even if a linear search is performed, and they are Naturally, I think it's better to skip. For example, if there are many elements in run B that are smaller than the minimum value a of run A, it is wasteful to compare the magnitudes of the minimum value a of run A and the elements of run B in order from the minimum value. Therefore, a method to implement some kind of skip is required.

** Unranged binary search ** does that efficiently. This makes it possible to bridge the case where the linear search is more efficient and the binary search is more efficient because the elements are less biased.

2-3-2. Binary search without range

In English, it is called exponential search or galloping search, but it is as follows.

Here, the two runs to be merged are A and B, and the smallest unmerged element of A is $ A [0] $. We will compare this with the element $ B [i] , (i = 0, 1,2,3,4, \ cdots) $ of run B.

  1. Increase $ B [0], B [1], B [3], B [7], B [15], \ cdots B [2 ^ {n} --1] $ and interval by 2 exponents Find out where $ A [0] $ falls into the range $ (B [2 ^ {n}] \ sim B [2 ^ {n + 1} -1]) $.

  2. Find where $ A [0] $ goes in binary search with $ (B [2 ^ {n}] \ sim B [2 ^ {n + 1} -1]) $ in that range To do

It becomes the procedure.

Here are some of the great things about this approach: In the combination of two runs with less bias that basically makes the linear search more efficient, first, in the above search, it is roughly between $ B [0] \ sim B [1] $ and $ B [1]. ] \ sim B [3] $ tends to be inserted between narrow ranges of $ A [0] $. In such a case, $ A [0] $ will get caught in the search range early and prevent you from searching far away in vain. (The biggest drawback of binary search) Moreover, since the range of $ B [i] $ that is the target after being caught is narrow, there is no big difference between linear search and binary search from there. On the other hand, if a binary search is better, it is a fairly large interval $ (B [2 ^ {n}] \ sim B [2 ^ {n + 1} -1]) (n \ gg 1) $ Since $ A [0] $ should be entered in, the binary search is naturally more efficient in such a case. In this way, it can be said that the rangeless binary search considers both cases in a well-balanced manner.

Basically, Tim Sort has the atmosphere of performing a natural switch between linear search and binary search through "rangeless binary search". It seems that the index of switching there is called ** Gallop Mode Threshold **. The threshold seems to be set at ** 7 **. This is based on the number of consecutive elements from the beginning (end) of one run to be merged that can be copied to the merged array in that order. If there are more consecutive copies than this value ** 7 **, it seems that the binary search is actively performed by saying ** Gallop mode **. The standard for this number ** 7 ** seems to come from the results so far that the linear search is faster if it is 7 or less.

In the background of setting such a mode, if there is a tendency of such two runs near the minimum value and the maximum value of the element, the sequence of other intermediate size elements of both runs It seems that it is set because there will be a similar tendency in.

2-4. Memory for copying columns used for merge sort

When merging two columns with merge sort, reserve a copy of one of the columns in memory. Basically, in order to perform merge sort with the number of elements $ O (n) $, it is necessary to secure a memory area of $ O (n) $. Allocating that memory area is an overhead in merge sort. To reduce this overhead and perform merge sort faster, try to reduce this memory allocation area.

In TimSort, if there are two runs, ** make a copy of the run with the smaller number of elements. ** However, in the case of run, since it is a column that has already been sorted, it is possible to further reduce the allocation of memory area. How to do it?

First, pay attention to the larger of the minimum values of the two runs. Let A be the one with the larger element. If elements with smaller values are consecutively lined up in run B, which is not A, the subsequence can be left there without any tampering. (This is effective by merging consecutive memory areas as described below.) That amount can be merge-sorted or left as it is before, so it is excluded from the target. In addition, we will do the same for the larger one. This will reduce the size of the effective run that will be merge-sorted.

When searching for a continuous column like the one above, perform the "rangeless binary search" mentioned earlier.

Therefore, the memory area required for copying can be reduced ** by the number of consecutive minimum and maximum elements that are continuously copied to the merged array **. In that way, it seems that the overhead is reduced and the speed is increased.

2-4-1. Runs in adjacent areas of memory are combined

Basically, when putting a run on the stack, put it in order from the adjacent comrades in the memory area. As you can see from the merge fake algorithm above, the run to be merged will be merged next to each other in the stack. If you continue this kind of merging, the things you want to merge will be merged in adjacent areas of memory.

This is very convenient. Reference-type variables such as arrays either point to a pointer or not, without changing the memory structure itself. What's more, when they are merged, you can easily handle them as an array simply by removing the extra middle pointer and changing the size to be handled. (Because the types of the elements of the array to be sorted are the same from the beginning), you can expect that you can easily create a large-sized array, and you can easily operate the array. (Please put in a rush here.)

And even better, since run is an array that's already sorted, you can do the following conveniences:

--In the array in the area before the memory, if there is a subarray consisting of elements smaller than the beginning of the array after it, the subarray is completed as a sorted state without really doing anything. --Conversely, if there is a sub-array consisting of elements larger than the last element of the previous array in the array in the area behind the memory, the sub-array will be completed as a sorted state without really doing anything. There is.

In this way, it can be read that the sorted nature of run and the adjacent nature of memory will effectively save unnecessary operations.

Finally

I tried to explain it with a lot of sensory words and intuitive calculations so that I could get angry and understand TimSort myself. It may have been a very rigorous and eyebrowsing article for those who like accuracy. On the other hand, I'm afraid of making mistakes, and even if I can't step into the core part of building my understanding and get into a list of jargon wrapped in smoke, that's what makes me personally dissatisfied. I feel like I've become.

Some people may complain about articles that are not very accurate, but as I wrote them, it took time and effort, and more accurate expressions are available in the sequel and the second and subsequent editions. I would appreciate it if you could forgive me for the fact that I have accumulated physical strength.

In addition, it took some time to understand and get hungry, and I think there are some parts that were completely misunderstood due to lack of understanding. If you find such a point, we would appreciate it if you could point out and give us guidance.

Recommended Posts

I studied TimSort ~ Sophisticated sorting method that is implemented as standard in Java and Python ~ (Part 1)
I wrote a class in Python3 and Java
Find the part that is 575 from Wikipedia in Python
I tried programming the chi-square test in Python and Java.
I compared Java and Python!
Difference between == and is in python
Use fabric as is in python (fabric3)
Implemented label propagation method in Python
I hear that Matlab and Python loops are slow, but is that true?
Overlapping regular expressions in Python and Java
I compared python3 standard argparse and python-fire
AM modulation and demodulation in Python Part 2
Differences in syntax between Python and Java
I implemented Cousera's logistic regression in Python
How to use is and == in Python
Sorting AtCoder ARC 086 C hashes to solve in Ruby, Perl, Java and Python
[Python] I tried to solve 100 past questions that beginners and intermediates should solve [Part 5/22]
[Python] I tried to solve 100 past questions that beginners and intermediates should solve [Part7 / 22]
[Python] I tried to solve 100 past questions that beginners and intermediates should solve [Part 4/22]
Note that I understand the least squares algorithm. And I wrote it in Python.
[Python] I tried to solve 100 past questions that beginners and intermediates should solve [Part3 / 22]
[Python] I tried to solve 100 past questions that beginners and intermediates should solve [Part 1/22]
[Python] I tried to solve 100 past questions that beginners and intermediates should solve [Part 6/22]