** (* Added on May 13, 2020: When the number of elements in the list is changed) **
At ABC167 at Atcoder the other day, I fired TLE with the following problem. ・ ABC167D-Teleporter
Teleport the towns in turn, stop when the loop occurs, and the remaining number of times I realized that I should use mod to find it,
Become a loop ⇔ Reach the same town again ⇔ Already in the list of visited towns
I thought it would be good to think, and it took a long time to process using X in list
in Python.
After all, I realized that I should use X in set
instead.
I was wondering if the speed would change so much, so I summarized the results of my research.
Quoted from Python's Official Documentation
The set object is an unordered collection of unique hashable objects. Common uses include attribution tests, deduplication from sequences, intersections, unions, difference sets, and the calculation of mathematical operations such as symmetric difference (exclusive).
Sets support x in set, len (set), for x in set, just like any other collection. Since the collection has no order, the set does not record the order of insertions or the position of the elements. Therefore, sets do not support indexing, slicing, or other sequential behavior.
In addition, the features are summarized in an easy-to-understand manner in here.
- Same element contains only one </ strong>. -- list-like operations such as x in set, len (set), for x in set </ strong> --Elements can be added / deleted later (tuple cannot be changed, list can be changed, so it is closer to list) --Aggregate calculation is possible (calculations such as being included in A but not in B, being included in both A and B, etc. can be easy and fast </ strong>) --The order is not preserved, so the first one you put in is not always the first --The operation of "x in set" is super fast compared to list </ strong>
In short, it seems to be a data type that is stored in a sorted state without duplication. For this reason, binary search is always available for set type search and seems to be faster.
In order to investigate the list
type and set
type, we investigated the search in the following four cases.
number | Data type | Data order | Search method |
---|---|---|---|
1 | list |
random | Linear search |
2 | list |
Sorted | Linear search |
3 | list |
Sorted | Binary search |
4 | set |
(Sorted) | (Binary search) |
To make other conditions the same
--The number of elements in the list is $ 10 ^ 6 $ and there is no duplication. --The element $ t $ to search satisfies $ 1 \ le t \ le 10 ^ 6 $. (Absolutely found) -Randomly generate $ 10 ^ 2 $, $ 10 ^ 3 $, $ 10 ^ 4 $ and search for each.
I did it as.
main.py
import time
import random
from collections import deque
num = [i+1 for i in range(1000000)]
num_shaffle = random.sample(num, len(num))
num_set = set(num_shaffle)
#What to look for
target = [random.randint(1, 1000000) for i in range(10000)]
#list type linear search or set type
def search_seaquence(L):
each_time = deque([])
for i in range(10000):
#elapsed from start_Measure the time difference to time
start = time.time()
target[i] in L
elapsed_time = time.time() - start
each_time.append(elapsed_time)
return list(each_time)
#list type binary search
def search_half(L):
each_time = deque([])
for i in range(10000):
low = 0
high = len(num) - 1
start = time.time()
while low <= high:
mid = int((low + high) // 2)
guess = L[mid]
if guess == target[i]:
break
elif guess > target[i]:
high = mid - 1
else:
low = mid + 1
elapsed_time = time.time() - start
each_time.append(elapsed_time)
return list(each_time)
def cal_time(LIS, mode):
print(mode)
#total time
sum_time = sum(LIS)
print ("sum_time:{0}".format(sum_time) + "[sec]")
return 0
#Random list / linear search
random_list = search_seaquence(num_shaffle)
cal_time(random_list, 'mode: random list')
#Permutation list / linear search
seaquence_list_seq = search_seaquence(num)
cal_time(seaquence_list_seq, 'mode: seaquence list seaquence')
#Permutation list / binary search
seaquence_list_half = search_half(num)
cal_time(seaquence_list_half, 'mode: seaquence list half')
#set search
set_seek = search_seaquence(num_set)
cal_time(set_seek, 'mode: set')
Search method | |||
---|---|---|---|
Random list / linear search | |||
Sorted list / linear search | |||
Sorted list / binary search | |||
set type |
Somehow I try to make it into a graph.
(** Log-log graph **.)
As expected, that's right, but it's not so different depending on the order.
I think the order for the binary search was $ O (\ log n) $, but it looks quite different.
Even with the same binary search, it was surprising that the set
type is about 20 times faster with $ 10 ^ 4 $ pieces.
(Maybe it's a problem with my code)
When I think about it, I realized that the effect of the order cannot be seen unless the number of elements in the search destination (list
/ set
) for searching is changed.
In addition, the number of elements in the search destination list is set to $ 10 ^ 4, 10 ^ 5, 10 ^ 6 $ while the search target is fixed at $ 10 ^ 4 $.
I also investigated when it was changed.
Search method\Number of list elements | |||
---|---|---|---|
Random list / linear search | |||
Sorted list / linear search | |||
Sorted list / binary search | |||
set type |
(** Log-log graph **.)
It's exactly as ordered. I'm glad. It seems that the reason why the time is rather reduced by the binary search is that the search destination is decided at random. By doing this, you can see the strength of the log order.
Do not use X in list
when looking for duplicate data (commandment)
After all, binary search is fast
It seems that it can also be used when searching for specific data from a large amount of data.
However, if it is a set
type, the index information will be lost.
When I wanted that information, I found that it was necessary to keep the list
separately.
Recommended Posts