** (* Added on May 13, 2020: When the number of elements in the list is changed) **

background

At ABC167 at Atcoder the other day, I fired TLE with the following problem. ・ ABC167D-Teleporter

Teleport the towns in turn, stop when the loop occurs, and the remaining number of times I realized that I should use mod to find it,

Become a loop ⇔ Reach the same town again ⇔ Already in the list of visited towns

I thought it would be good to think, and it took a long time to process using X in list in Python.

After all, I realized that I should use X in set instead. I was wondering if the speed would change so much, so I summarized the results of my research.

What is set type?

Quoted from Python's Official Documentation

The set object is an unordered collection of unique hashable objects. Common uses include attribution tests, deduplication from sequences, intersections, unions, difference sets, and the calculation of mathematical operations such as symmetric difference (exclusive).

Sets support x in set, len (set), for x in set, just like any other collection. Since the collection has no order, the set does not record the order of insertions or the position of the elements. Therefore, sets do not support indexing, slicing, or other sequential behavior.

In addition, the features are summarized in an easy-to-understand manner in here.

- Same element contains only one </ strong>. -- list-like operations such as x in set, len (set), for x in set </ strong> --Elements can be added / deleted later (tuple cannot be changed, list can be changed, so it is closer to list) --Aggregate calculation is possible (calculations such as being included in A but not in B, being included in both A and B, etc. can be easy and fast </ strong>) --The order is not preserved, so the first one you put in is not always the first --The operation of "x in set" is super fast compared to list </ strong>

In short, it seems to be a data type that is stored in a sorted state without duplication. For this reason, binary search is always available for set type search and seems to be faster.

Measurement condition

In order to investigate the list type and set type, we investigated the search in the following four cases.

number Data type Data order Search method

1 list random Linear search

2 list Sorted Linear search

3 list Sorted Binary search

4 set (Sorted) (Binary search)

To make other conditions the same

--The number of elements in the list is $ 10 ^ 6 $ and there is no duplication. --The element $ t $ to search satisfies $ 1 \ le t \ le 10 ^ 6 $. (Absolutely found) -Randomly generate $ 10 ^ 2 $, $ 10 ^ 3 $, $ 10 ^ 4 $ and search for each.

I did it as.

Source code

main.py

import time import random from collections import deque num = [i+1 for i in range(1000000)] num_shaffle = random.sample(num, len(num)) num_set = set(num_shaffle) #What to look for target = [random.randint(1, 1000000) for i in range(10000)] #list type linear search or set type def search_seaquence(L): each_time = deque([]) for i in range(10000): #elapsed from start_Measure the time difference to time start = time.time() target[i] in L elapsed_time = time.time() - start each_time.append(elapsed_time) return list(each_time) #list type binary search def search_half(L): each_time = deque([]) for i in range(10000): low = 0 high = len(num) - 1 start = time.time() while low <= high: mid = int((low + high) // 2) guess = L[mid] if guess == target[i]: break elif guess > target[i]: high = mid - 1 else: low = mid + 1 elapsed_time = time.time() - start each_time.append(elapsed_time) return list(each_time) def cal_time(LIS, mode): print(mode) #total time sum_time = sum(LIS) print ("sum_time:{0}".format(sum_time) + "[sec]") return 0 #Random list / linear search random_list = search_seaquence(num_shaffle) cal_time(random_list, 'mode: random list') #Permutation list / linear search seaquence_list_seq = search_seaquence(num) cal_time(seaquence_list_seq, 'mode: seaquence list seaquence') #Permutation list / binary search seaquence_list_half = search_half(num) cal_time(seaquence_list_half, 'mode: seaquence list half') #set search set_seek = search_seaquence(num_set) cal_time(set_seek, 'mode: set')

result

Search method 10^2Total(sec) 10^3Total(sec) 10^4Total(sec)

Random list / linear search 5.14 4.01 \times 10 4.65 \times 10^2

Sorted list / linear search 1.13 8.36 1.29 \times 10^2

Sorted list / binary search 3.46 \times 10^{-3} 2.03 \times 10^{-2} 1.16 \times 10^{-1}

set type 8.44 \times 10^{-5} 9.92 \times 10^{-4} 5.45 \times 10^{-3}

Somehow I try to make it into a graph.

(** Log-log graph **.)

As expected, that's right, but it's not so different depending on the order. I think the order for the binary search was $ O (\ log n) $, but it looks quite different. Even with the same binary search, it was surprising that the set type is about 20 times faster with $ 10 ^ 4 $ pieces. (Maybe it's a problem with my code)

[Addition] When the number of elements in the search destination is changed

When I think about it, I realized that the effect of the order cannot be seen unless the number of elements in the search destination (list / set) for searching is changed. In addition, the number of elements in the search destination list is set to $ 10 ^ 4, 10 ^ 5, 10 ^ 6 $ while the search target is fixed at $ 10 ^ 4 $. I also investigated when it was changed.

Search method\Number of list elements 10^4Total(sec) 10^5Total(sec) 10^6Total(sec)

Random list / linear search 1.68 2.58 \times 10 5.37 \times 10^2

Sorted list / linear search 9.26 \times 10^{-1} 1.29 \times 10 1.34 \times 10^2

Sorted list / binary search 8.93 \times 10^{-2} 1.75 \times 10^{-1} 1.76 \times 10^{-1}

set type 5.62 \times 10^{-3} 4.02 \times 10^{-3} 1.01 \times 10^{-2}

(** Log-log graph **.)

It's exactly as ordered. I'm glad. It seems that the reason why the time is rather reduced by the binary search is that the search destination is decided at random. By doing this, you can see the strength of the log order.

Summary

Do not use X in list when looking for duplicate data (commandment) After all, binary search is fast

It seems that it can also be used when searching for specific data from a large amount of data. However, if it is a set type, the index information will be lost. When I wanted that information, I found that it was necessary to keep the list separately.

Recommended Posts
I investigated the calculation time of "X in list" (linear search / binary search) and "X in set"

I compared the calculation time of the moving average written in Python

Search by the value of the instance in the list

I checked the reference speed when using python list, dictionary, and set type in.

I compared the speed of the reference of the python in list and the reference of the dictionary comprehension made from the in list.

I tried to illustrate the time and time in C language

I investigated the behavior of the difference between hard links and symbolic links

List of Linear Programming (LP) solvers and modelers available in Python

Verify the compression rate and time of PIXZ used in practice

I replaced the numerical calculation of Python with Rust and compared the speed

I measured the speed of list comprehension, for and while with python2.7.

In search of the fastest FizzBuzz in Python

What I investigated in the process of expressing (schematicizing) containers in a nested frame with Jupyter and making it

I can't use the "next_results" parameter in the Twitter API Search API! ?? Causes and remedies

plot the coordinates of the processing (python) list and specify the number of times in draw ()

I installed Pygame with Python 3.5.1 in the environment of pyenv on OS X

Talking about the features that pandas and I were in charge of in the project

I want to set a life cycle in the task definition of ECS

I want to see a list of WebDAV files in the Requests module

I want to store the result of% time, %% time, etc. in an object (variable)

[Python] Sort the list of pathlib.Path in natural sort

Make a copy of the list in Python

I checked the list of shortcut keys of Jupyter

[At Coder] Solve the problem of binary search

I read and implemented the Variants of UKR

[Linux] I learned LPIC lv1 in 10 days and tried to understand the mechanism of Linux.

I just wanted to extract the data of the desired date and time with Django

Check the processing time and the number of calls for each process in python (cProfile)

It's time to seriously think about the definition and skill set of data scientists

A program struggle to process a floating-point list in parallel on four clusters of Raspberry Pi 4B via Python socket communication and reduce the loop calculation time to 1/4.

"Linear regression" and "Probabilistic version of linear regression" in Python "Bayesian linear regression"

Experience the good calculation efficiency of vectorization in Python

Extract and list personal names and place names in the text

Calculation of standard deviation and correlation coefficient in Python

In search of the best random dot stereogram (RDS).

[python] Get the list of classes defined in the module

[python] Calculation of months and years of difference in datetime

I checked the default OS and shell of docker-machine

[Python] Outputs all combinations of elements in the list

Overview of generalized linear models and implementation in Python

I investigated the reinforcement learning algorithm of algorithmic trading

[Golang] Command to check the supported GOOS and GOARCH in a list (Check the supported platforms of the build)

I tried to automatically post to ChatWork at the time of deployment with fabric and ChatWork Api

The story of creating a "spirit and time chat room" exclusively for engineers in the company

[Scientific / technical calculation by Python] List of matrices that appear in Hinpan in numerical linear algebra

Python3 I don't know the balanced binary search tree, but I wish I had a sorted set.

number	Data type	Data order	Search method
1	`list`	random	Linear search
2	`list`	Sorted	Linear search
3	`list`	Sorted	Binary search
4	`set`	(Sorted)	(Binary search)

Search method	10^2Total(sec)	10^3Total(sec)	10^4Total(sec)
Random list / linear search	5.14	4.01 \times 10	4.65 \times 10^2
Sorted list / linear search	1.13	8.36	1.29 \times 10^2
Sorted list / binary search	3.46 \times 10^{-3}	2.03 \times 10^{-2}	1.16 \times 10^{-1}
set type	8.44 \times 10^{-5}	9.92 \times 10^{-4}	5.45 \times 10^{-3}

Search method\Number of list elements	10^4Total(sec)	10^5Total(sec)	10^6Total(sec)
Random list / linear search	1.68	2.58 \times 10	5.37 \times 10^2
Sorted list / linear search	9.26 \times 10^{-1}	1.29 \times 10	1.34 \times 10^2
Sorted list / binary search	8.93 \times 10^{-2}	1.75 \times 10^{-1}	1.76 \times 10^{-1}
set type	5.62 \times 10^{-3}	4.02 \times 10^{-3}	1.01 \times 10^{-2}