I investigated the calculation time of "X in list" (linear search / binary search) and "X in set"

** (* Added on May 13, 2020: When the number of elements in the list is changed) **

background

At ABC167 at Atcoder the other day, I fired TLE with the following problem. ・ ABC167D-Teleporter

Teleport the towns in turn, stop when the loop occurs, and the remaining number of times I realized that I should use mod to find it,

Become a loop ⇔ Reach the same town again ⇔ Already in the list of visited towns

I thought it would be good to think, and it took a long time to process using X in list in Python.

After all, I realized that I should use X in set instead. I was wondering if the speed would change so much, so I summarized the results of my research.

What is set type?

Quoted from Python's Official Documentation

The set object is an unordered collection of unique hashable objects. Common uses include attribution tests, deduplication from sequences, intersections, unions, difference sets, and the calculation of mathematical operations such as symmetric difference (exclusive).

Sets support x in set, len (set), for x in set, just like any other collection. Since the collection has no order, the set does not record the order of insertions or the position of the elements. Therefore, sets do not support indexing, slicing, or other sequential behavior.

In addition, the features are summarized in an easy-to-understand manner in here.

- Same element contains only one </ strong>. -- list-like operations such as x in set, len (set), for x in set </ strong> --Elements can be added / deleted later (tuple cannot be changed, list can be changed, so it is closer to list) --Aggregate calculation is possible (calculations such as being included in A but not in B, being included in both A and B, etc. can be easy and fast </ strong>) --The order is not preserved, so the first one you put in is not always the first --The operation of "x in set" is super fast compared to list </ strong>

In short, it seems to be a data type that is stored in a sorted state without duplication. For this reason, binary search is always available for set type search and seems to be faster.

Measurement condition

In order to investigate the list type and set type, we investigated the search in the following four cases.

number Data type Data order Search method
1 list random Linear search
2 list Sorted Linear search
3 list Sorted Binary search
4 set (Sorted) (Binary search)

To make other conditions the same

--The number of elements in the list is $ 10 ^ 6 $ and there is no duplication. --The element $ t $ to search satisfies $ 1 \ le t \ le 10 ^ 6 $. (Absolutely found) -Randomly generate $ 10 ^ 2 $, $ 10 ^ 3 $, $ 10 ^ 4 $ and search for each.

I did it as.

Source code

main.py


import time
import random
from collections import deque

num = [i+1 for i in range(1000000)]
num_shaffle = random.sample(num, len(num))
num_set = set(num_shaffle)
#What to look for
target = [random.randint(1, 1000000) for i in range(10000)]

#list type linear search or set type
def search_seaquence(L):
    each_time = deque([])

    for i in range(10000):
        #elapsed from start_Measure the time difference to time
        start = time.time()
        target[i] in L
        elapsed_time = time.time() - start
        each_time.append(elapsed_time)

    return list(each_time)

#list type binary search
def search_half(L):
    each_time = deque([])

    for i in range(10000):
        low = 0
        high = len(num) - 1
        start = time.time()
        
        while low <= high:
            mid = int((low + high) // 2)
            guess = L[mid]
            if guess == target[i]:
                break
            elif guess > target[i]:
                high = mid - 1
            else:
                low = mid + 1

        elapsed_time = time.time() - start
        each_time.append(elapsed_time)
    return list(each_time)


def cal_time(LIS, mode):
    print(mode)
    #total time
    sum_time = sum(LIS)
    print ("sum_time:{0}".format(sum_time) + "[sec]")

    return 0


#Random list / linear search
random_list = search_seaquence(num_shaffle)
cal_time(random_list, 'mode: random list')

#Permutation list / linear search
seaquence_list_seq = search_seaquence(num)
cal_time(seaquence_list_seq, 'mode: seaquence list seaquence')

#Permutation list / binary search
seaquence_list_half = search_half(num)
cal_time(seaquence_list_half, 'mode: seaquence list half')

#set search
set_seek = search_seaquence(num_set)
cal_time(set_seek, 'mode: set')

result

Search method 10^2Total(sec) 10^3Total(sec) 10^4Total(sec)
Random list / linear search 5.14 4.01 \times 10 4.65 \times 10^2
Sorted list / linear search 1.13 8.36 1.29 \times 10^2
Sorted list / binary search 3.46 \times 10^{-3} 2.03 \times 10^{-2} 1.16 \times 10^{-1}
set type 8.44 \times 10^{-5} 9.92 \times 10^{-4} 5.45 \times 10^{-3}

Somehow I try to make it into a graph. list_order.png

(** Log-log graph **.)

As expected, that's right, but it's not so different depending on the order. I think the order for the binary search was $ O (\ log n) $, but it looks quite different. Even with the same binary search, it was surprising that the set type is about 20 times faster with $ 10 ^ 4 $ pieces. (Maybe it's a problem with my code)

[Addition] When the number of elements in the search destination is changed

When I think about it, I realized that the effect of the order cannot be seen unless the number of elements in the search destination (list / set) for searching is changed. In addition, the number of elements in the search destination list is set to $ 10 ^ 4, 10 ^ 5, 10 ^ 6 $ while the search target is fixed at $ 10 ^ 4 $. I also investigated when it was changed.

Search method\Number of list elements 10^4Total(sec) 10^5Total(sec) 10^6Total(sec)
Random list / linear search 1.68 2.58 \times 10 5.37 \times 10^2
Sorted list / linear search 9.26 \times 10^{-1} 1.29 \times 10 1.34 \times 10^2
Sorted list / binary search 8.93 \times 10^{-2} 1.75 \times 10^{-1} 1.76 \times 10^{-1}
set type 5.62 \times 10^{-3} 4.02 \times 10^{-3} 1.01 \times 10^{-2}

image.png

(** Log-log graph **.)

It's exactly as ordered. I'm glad. It seems that the reason why the time is rather reduced by the binary search is that the search destination is decided at random. By doing this, you can see the strength of the log order.

Summary

Do not use X in list when looking for duplicate data (commandment) After all, binary search is fast

It seems that it can also be used when searching for specific data from a large amount of data. However, if it is a set type, the index information will be lost. When I wanted that information, I found that it was necessary to keep the list separately.

Recommended Posts