Sorting algorithm and implementation in Python

Introduction

I've summarized some sorting algorithms.

I described each point, implementation example in Python, and measurement of each speed.

I implemented it without using built-in functions as much as possible, so it's not very fast, but I wanted to make something faster than sort () and sorted ().

Sort algorithm to handle

The sorting algorithm to handle is

--Bubble sort --Select sort --Insert sort --Merge sort --Quick sort --Count sort

name Average calculation time Worst calculation time memory usage Stability
Bubble sort O(n^2) O(n^2) O(1) Stable
Selection sort O(n^2) O(n^2) O(1) Not stable
Insertion sort O(n^2) O(n^2) O(1) Stable
Merge sort O(n\log n) O(n\log n) O(n) Stable
Quick sort O(n\log n) O(n^2) O(\log n) Not stable
Count sort O(nk) O(n^2 k) O(nk) Not stable

Other well-known sorting algorithms include heapsort, shellsort, and radix sort, but this time I have forgotten them. I want to add it someday.

For heapsort, it's pretty easy to implement with Python's heapq module, but it feels different from doing sort () ...

Stability means that when data such as [1-b, 2-a, 1-a, 2-b] is sorted by number as a key, the order before sorting is saved, and [1-b, We recognize that something like 1-a, 2-a, 2-b] is defined as stable.

Each point and implementation

Bubble sort

--Compare and exchange the size of each side, and do it until there is no need to change. --Simple and easy to understand.

bubble_sort.py


def bubble_sort(arr):
    change = True
    while change:
        change = False
        for i in range(len(arr) - 1):
            if arr[i] > arr[i + 1]:
                arr[i], arr[i + 1] = arr[i + 1], arr[i]
                change = True
    return arr

Selection sort

--Select the minimum or maximum in the array and move it to the end. --Simple and easy to understand.

I feel like I can use min () ...

In order to reduce the memory usage to $ O (1) $, we implement it by replacing instead of putting it in a new list.

sellect_sort.py


def sellect_sort(arr):
    for ind, ele in enumerate(arr):
        min_ind = min(range(ind, len(arr)), key=arr.__getitem__)
        arr[ind], arr[min_ind] = arr[min_ind], ele
    return arr

Insertion sort

--In order from the left, insert into the appropriate part of the arranged array. --When searching for a place to insert, you can speed up by using a 2-minute search. --Compared to merge sort and quick sort, it does not depend on the state of data. --Strong against almost sorted arrays and relatively small arrays. --Simple and easy to understand.

If you do not use the 2-minute search,

insert_sort.py


def insert_sort(arr):
    for i in range(1, len(arr)):
        j = i - 1
        ele = arr[i]
        while arr[j] > ele and j >= 0:
            arr[j + 1] = arr[j]
            j -= 1
        arr[j + 1] = ele
    return arr

When using a 2-minute search,

insert_sort_bin.py


import sys
sys.setrecursionlimit(10 ** 7)

def binary_search(arr, low, hig, ele):
    if low == hig:
        if arr[low] > ele:
            return low
        else:
            return low + 1
    elif low > hig:
        return low
    
    mid = (low + hig) // 2
    if arr[mid] < ele:
        return binary_search(arr, mid + 1, hig, ele)
    elif arr[mid] > ele:
        return binary_search(arr, low, mid - 1, ele)
    else:
        return mid

def insert_sort_bin(arr):
    for i in range(1, len(arr)):
        ele = arr[i]
        ind = binary_search(arr, 0, i - 1, ele)
        arr[:] = arr[:ind] + [ele] + arr[ind:i] + arr[i + 1:]
    return arr

By the way, when using bisect, which is a module for 2-minute search,

insert_sort_bin_buildin.py


import bisect
def insert_sort_bin_buildin(arr):
    for i in range(1, len(arr)):
        bisect.insort(arr, arr.pop(i), 0, i)
    return arr

Merge sort

--First, repeat the division, and then combine in order. --As a result, splitting and joining are done recursively. --Sort in the process of combining. ――Because the sorted arrays are joined together, devise a joining method. --Heapq merge can also be used for merging. --If you can divide it to a certain extent, you can speed it up by using another sorting algorithm. ――It's pretty fast, but it uses a lot of memory. --Can be processed in parallel

merge_sort.py


def merge_sort(arr):
    if len(arr) <= 1:
        return arr
    
    mid = len(arr) // 2
    #Do the split here
    left = arr[:mid]
    right = arr[mid:]
    
    #Recursively split
    left = merge_sort(left)
    right = merge_sort(right)
    
    #When return is returned, combine and pass the combined next
    return merge(left, right)


def merge(left, right):
    merged = []
    l_i, r_i = 0, 0
    
    #To merge the sorted arrays, just look at each from the left
    while l_i < len(left) and r_i < len(right):
        #here=The stability is maintained by attaching
        if left[l_i] <= right[r_i]:
            merged.append(left[l_i])
            l_i += 1
        else:
            merged.append(right[r_i])
            r_i += 1
    
    #If either of the above while statements becomes False, it will end, so extend too much
    if l_i < len(left):
        merged.extend(left[l_i:])
    if r_i < len(right):
        merged.extend(right[r_i:])
    return merged

Quick sort

――Determine a reference value and divide it into two arrays, large and small according to the reference value. --As a method of selecting this reference value, use the beginning and end of the array, the median value of about three appropriate values, and so on. --The same processing is performed recursively within the divided groups. --Performance depends on how to determine the standard value. --If you can divide it to a certain extent, you can speed it up by using another sorting algorithm. ――It's pretty fast, but it uses a lot of memory. --Not a stable sort.

quick_sort.py


def quick_sort(arr):
    left = []
    right = []
    if len(arr) <= 1:
        return arr
    
    #Random because it does not depend on the state of the data.choice()May be used.
    # ref = random.choice(arr)
    ref = arr[0]
    ref_count = 0
    
    for ele in arr:
        if ele < ref:
            left.append(ele)
        elif ele > ref:
            right.append(ele)
        else:
            ref_count += 1
    left = quick_sort(left)
    right = quick_sort(right)
    return left + [ref] * ref_count + right

Count sort

--Count the number of each value. --Easy to implement. --Very fast depending on the range of values.

Bucket sort is a sorting algorithm that is very similar to count sorting. Bucket sort is like a sorting algorithm that creates several boxes with a certain range, puts values in them, sorts each box, and then combines them.

Count sorting can also maintain stability by creating a box for each value and adding it to it, but this is the implementation because the idea is a bucket rather than a count. If you use the collections.defaultdict module, you can implement it quite easily and stably.

count_sort.py


def count_sort(arr):
    max_num = max(arr)
    min_num = min(arr)
    count = [0] * (max_num - min_num + 1)
    for ele in arr:
        count[ele - min_num] += 1
 
    return [ele for ele, cnt in enumerate(count, start=min_num) for __ in range(cnt)]

Speed comparison

Data set to use

Generated with numpy.random.

The created dataset is

--data1: 100 uniform random numbers of integers in the range 0-100 --data2: 10000 integer uniform random numbers in the range 0-1000 --data3: 10000 uniform random numbers of integers in the range 0-100 --data4: 100 random numbers of standard normal distribution with mean 10 and standard deviation 5 --data5: 10000 random numbers with standard normal distribution in the range of mean 100 and standard deviation 50 --data6: 100 uniform random numbers in the range 0-100 already sorted in reverse order --data7: Already sorted by 100 uniform random numbers of integers in the range 0-100, only the first 10 are messed up

make_data.py


data_rand_1 = list(randint(0, 10 ** 2, 10 ** 2))
data_rand_2 = list(randint(0, 10 ** 4, 10 ** 4))
data_rand_3 = list(randint(0, 10 ** 2, 10 ** 4))
data_rand_4 = [int(normal(10, 5)) for __ in range(10 ** 2)]
data_rand_5 = [int(normal(10 ** 2, 50)) for __ in range(10 ** 4)]
data_rand_6 = sorted(randint(0, 10 ** 2, 10 ** 2), reverse=True)
data_rand_7 = sorted(randint(0, 10 ** 2, 10 ** 2))
data_rand_7[0: 10] = choice(data_rand_7[0: 10], 10, replace=False)

Unknown.png

Unknown-1.png

Unknown-2.png

Unknown-3.png

Unknown-4.png

Unknown-5.png

Unknown-6.png

Speed measurement

The following code was used to measure 100 times and the average was calculated. I couldn't measure it well with the timeit command of Jupyter notebook.

The unit is μs.

Measurements were made for each algorithm in order to reduce the error as much as possible. Perhaps it depends on the specifications of the CPU and memory, but if the implementation was such that all algorithms could be measured at once using a loop, the performance would drop significantly.

time.py


datas = [data_rand_1, data_rand_2, data_rand_3,
             data_rand_4, data_rand_5, data_rand_6, data_rand_7]

for ind, data in enumerate(datas):
    print("---- DataSet : {0} ----".format(ind + 1))
    times = []
    for i in range(100):
        start = time.time()
        #Sort algorithm you want to measure
        sorted(data)
        elapsed_time = int((time.time() - start) * (10 ** 6))
        times.append(elapsed_time)

    print(sum(times) // 100)
type Uniform random number 大きいUniform random number 狭くて大きいUniform random number normal distribution 大きいnormal distribution Sorted in reverse order Almost sorted
Python built-in sort 16 3488 2938 10 2447 12 6
Bubble sort 25 167789 163781 21 154426 26 10
Selection sort 418 4025181 3889431 379 3552541 433 422
Insertion sort 24 58281 53705 19 47764 25 17
Insertion sort(2 minutes search) 627 1382480 1338027 334 1278530 347 339
Merge sort 593 67221 61657 329 59483 270 233
Quick sort 142 25911 12009 56 12044 469 542
Count sort 95 5597 2164 24 1607 57 51

Consideration of the result for the time being

--Python's built-in sorting is fast. --Insert sort is relatively fast among low-grade sorts. --It doesn't get faster even if you search for 2 minutes by insertion sort. --Merge sort and quick sort are unexpectedly slow. With larger data, I think it will be much faster if you put a count sort in the middle of quicksort. ――Most algorithms do not deteriorate the result depending on the data state, but the performance of only quick sort deteriorates depending on the data state.

Seeking an algorithm faster than built-in sorting

For narrow ranges, count sort seems to be faster than built-in sort.

I think that if we can implement bucket sort, which is a relative of count sort, we can build a faster algorithm.

For the time being, combine quick sort and count sort to put in a large amount of data.

The prepared data is big_data = list (randint (0, 10 ** 3, 10 ** 6)).

quick_sort_with_count.py


def quick_sort_with_count(arr):
    left = []
    right = []
    if len(arr) <= 1:
        return arr
    ref = choice(arr)
    ref_count = 0

    for ele in arr:
        if ele < ref:
            left.append(ele)
        elif ele > ref:
            right.append(ele)
        else:
            ref_count += 1

    if len(left) <= 1000:
        left = count_sort(left)
    else:
        left = quick_sort_with_count(left)

   if len(right) <= 1000:
        right = count_sort(right)
    else:
        right = quick_sort_with_count(right)

    return left + [ref] * ref_count + right
type result
Python built-in sort 464109
Quick sort 2069006
Count sort 250043
Quick sort+Count sort 3479010

In conclusion

I've run out of time for the time being, so that's it. If you have the time, take a fine-grained dataset and compare and validate Python's built-in sorts and count sorts.

Merge sort, quick sort, etc. are generally said to be fast, but for the time being, it seems that it is not so fast because it takes time for recursive calls in Python.

If you know the range and type of values, count sorting may be useful. Japanese Wikipedia doesn't even have an article, and some English documents are confused with bucket sort, but ...

Reference page

-Wikipedia Sort: A comprehensive description of the sort algorithm. -Rosetta Code: A collection of implementation examples collected for each algorithm language. I tried to drop it from here. -Random number generation summary by Numpy -Measure and display the processing time

Recommended Posts

Sorting algorithm and implementation in Python
Implementation of original sorting in Python
Implementation module "deque" in queue and Python
RNN implementation in python
Basic sorting in Python
ValueObject implementation in Python
Genetic algorithm in python
Algorithm in Python (Bellman-Ford)
SVM implementation in python
Algorithm in Python (Dijkstra's algorithm)
Explanation of edit distance and implementation in Python
Algorithm in Python (primality test)
Techniques for sorting in Python
Reproduce Euclidean algorithm in Python
Algorithm in Python (binary search)
Stack and Queue in Python
Neural network implementation in python
Implement Dijkstra's Algorithm in python
Unittest and CI in Python
Maxout description and implementation (Python)
Implementation of quicksort in Python
Merge sort implementation / complexity analysis and experiment in Python
Algorithm learned with Python 18th: Sorting (stack and queue)
Logical symbols learned in marriage (and implementation examples in Python)
Overview of generalized linear models and implementation in Python
Algorithm in Python (breadth-first search, bfs)
Difference between list () and [] in Python
Explanation and implementation of ESIM algorithm
Difference between == and is in python
HMM parameter estimation implementation in python
Mixed normal distribution implementation in python
Manipulate files and folders in Python
About dtypes in Python and Cython
Python algorithm
Write A * (A-star) algorithm in Python
Develop an investment algorithm in Python 2
Assignments and changes in Python objects
Implementation of life game in Python
Algorithm in Python (depth-first search, dfs)
Check and move directories in Python
Ciphertext in Python: IND-CCA2 and RSA-OAEP
Hashing data in R and Python
Function synthesis and application in Python
Implementing a simple algorithm in Python 2
Export and output files in Python
Implementation of Dijkstra's algorithm with python
Algorithm (segment tree) in Python (practice)
Reverse Hiragana and Katakana in Python2.7
Run a simple algorithm in Python
Reading and writing text in Python
[GUI in Python] PyQt5-Menu and Toolbar-
Create and read messagepacks in Python
Solving AOJ's Algorithm and Introduction to Data Structures in Python -Part1-
Solving AOJ's Algorithm and Introduction to Data Structures in Python -Part2-
Solving AOJ's Algorithm and Introduction to Data Structures in Python -Part4-
Solving AOJ's Algorithm and Introduction to Data Structures in Python -Part3-
Overlapping regular expressions in Python and Java
PRML Chapter 8 Product Sum Algorithm Python Implementation
Differences in authenticity between Python and JavaScript
Notes using cChardet and python3-chardet in Python 3.3.1.
Modules and packages in Python are "namespaces"