This article is the 18th day of Tabelog Advent Calendar 2020.
Hello, this is @an_sony of eating log Web engineer. I usually develop a reservation service system.
Recently, the GoToEat online business has increased the number of users of the reservation service (Thank you for using it!), And the number of performance tunings has increased.
As one of the tuning methods, I often used ** to speed up the element search process of the list by using Set # include?
Instead of Array # include?
**, but at one point, "that , Array and Set look similar, but why is this process so fast? "
Can you answer this reason?
It would be nice to look at the contents (implementation) of the Set class to clarify, but by the way, I have never seen the contents of Ruby or Rails. .. So I think it's a good opportunity, and I'll try to clarify it by looking at the contents.
↓ People who want to read
・ People who do not know or understand the Set class ・ Ruby, Rails beginner to intermediate
A class that handles "sets". A set is a collection of unique objects. There is no particular ordering relationship between the elements.
You can create a set by passing an array. At this time, the duplicate element is deleted.
[1] pry(main)> s = Set.new([1, 2, 1])
=> #<Set: {1, 2}>
Set # include?
Uses Array # include?
in the same way.
[2] pry(main)> s.include?(1)
=> true
[3] pry(main)> s.include?(3)
=> false
Set # include?
Is faster than Array # include?
, Especially when the number of elements is large. (Measurement comparison is not performed here), so it is very useful when handling a large amount of data such as batch processing.
Looking at the official reference, there was an explanation like this.
It has both the arithmetic function of Array and the high-speed search function of Hash.
Use Hash as internal memory.
It's a class that takes advantage of Array and Hash. And it seems that it is Hash's search function that realizes the speedup.
What it means is that it is realized by the "hash method" which is one of the data search algorithms. The hash method is a method that allows you to search for specific data in one shot by determining the storage position of the data when storing it in memory. The storage position is calculated by a fixed formula (hash function). * To be exact, Ruby uses the chain method. This makes it possible to search for a large amount of data at high speed.
Let's take a look at the contents of the Set class.
First of all, this is the content when Set.new
is done ↓
# Creates a new set containing the elements of the given enumerable
# object.
#
# If a block is given, the elements of enum are preprocessed by the
# given block.
#
# Set.new([1, 2]) #=> #<Set: {1, 2}>
# Set.new([1, 2, 1]) #=> #<Set: {1, 2}>
# Set.new([1, 'c', :s]) #=> #<Set: {1, "c", :s}>
# Set.new(1..5) #=> #<Set: {1, 2, 3, 4, 5}>
# Set.new([1, 2, 3]) { |x| x * x } #=> #<Set: {1, 4, 9}>
def initialize(enum = nil, &block) # :yields: o
@hash ||= Hash.new(false)
enum.nil? and return
if block
do_with_enum(enum) { |o| add(block[o]) }
else
merge(enum)
end
end
https://github.com/ruby/ruby/blob/master/lib/set.rb#L90-L100
Suddenly Hash.new
.
If you do Set.new ([1, 2, 1])
, you are calling the merge
method because there is no block.
Let's take a look at the merge
method.
# Merges the elements of the given enumerable object to the set and
# returns self.
def merge(enum)
if enum.instance_of?(self.class)
@hash.update(enum.instance_variable_get(:@hash))
else
do_with_enum(enum) { |o| add(o) }
end
self
end
https://github.com/ruby/ruby/blob/master/lib/set.rb#L438-L448
In the case of Set.new ([1, 2, 1])
, it seems that you are calling add
.
What add
is doing is
# Adds the given object to the set and returns self. Use +merge+ to
# add many elements at once.
#
# Set[1, 2].add(3) #=> #<Set: {1, 2, 3}>
# Set[1, 2].add([3, 4]) #=> #<Set: {1, 2, [3, 4]}>
# Set[1, 2].add(2) #=> #<Set: {1, 2}>
def add(o)
@hash[o] = true
self
end
alias << add
https://github.com/ruby/ruby/blob/master/lib/set.rb#L350-L360
The key of @hash is set.
That is, when you run Set.new ([1, 2, 1])
internally
@hash = {1=>true, 2=>true}
It seems that a Hash instance variable is created like this. So that's it~
So what is Set # include?
doing? (I can predict it ...)
# Returns true if the set contains the given object.
#
# Note that <code>include?</code> and <code>member?</code> do not test member
# equality using <code>==</code> as do other Enumerables.
#
# See also Enumerable#include?
def include?(o)
@hash[o]
end
alias member? include?
https://github.com/ruby/ruby/blob/master/lib/set.rb#L242-L251
Yes, it is judged whether the argument o exists in the key of @hash. Well, that's right.
In other words, the contents of Set # include?
-Create a Hash type instance variable and set a value in key
-Determine whether the value of the argument of include? Exists in key
What you are doing is the same as Hash # include?
.
What is the computational complexity order? Please refer to here for people ↓ Reference: Computational complexity order
Method | Computational complexity order |
---|---|
Set#include?, Hash#Include? | O(logn) |
Array#Include? | O(n) |
Set # include?
And Hash # include?
Are O (logn). Therefore, even if the number of elements increases, the processing does not slow down easily.
On the other hand, Array # include? Since the amount of calculation is O (n), the processing becomes slower as the number of elements increases.
Let's see the contents of Array # include?
And clarify the reason why it is slow. (It's C language ...)
/*
* call-seq:
* array.include?(obj) -> true or false
*
* Returns +true+ if for some index +i+ in +self+, <tt>obj == self[i]</tt>;
* otherwise +false+:
* [0, 1, 2].include?(2) # => true
* [0, 1, 2].include?(3) # => false
*/
VALUE
rb_ary_includes(VALUE ary, VALUE item)
{
long i;
VALUE e;
for (i=0; i<RARRAY_LEN(ary); i++) {
e = RARRAY_AREF(ary, i);
if (rb_equal(e, item)) {
return Qtrue;
}
}
return Qfalse;
}
https://github.com/ruby/ruby/blob/master/array.c#L5102-L5125
Only the length of ary
is looped by the for statement.
Therefore, if the number of ary
elements is large, the loop processing will be longer accordingly.
That's why it's O (n).
Hash # include?
Has also evolved recentlyI was also investigating the Hash class, but when Ruby 2.6 introduced a technology called Transient Heap, I found an article that devised a Hash search method.
[Transient Heap device: Hash implementation](https://techlife.cookpad.com/entry/2018/12/27/093914#Transient-Heap-%E3%81%AE%E5%B7%A5%E5% A4% ABHash-% E3% 81% AE% E5% AE% 9F% E8% A3% 85 )
Although it is another language, Python also has a Set type, which seems to speed up element search compared to the List type. This speedup has the same mechanism and is realized by the hash method.
[What made it explosive just by changing from "in list" to "in set" in Python and why] (https://qiita.com/kitadakyou/items/6f877edd263f097e78f4)
Now we have a clear idea of how Set # include?
Works and why it's so fast.
When I usually write code, I think that I don't pay much attention to the contents of classes and methods, but knowing how it works helps me understand the underlying technology, so when I have time, I use Ruby. We recommend that you take a look inside the Rails.
Tomorrow is @ saratani's "A story about how Asana made daily MTG a little easier". looking forward to!
https://docs.ruby-lang.org/ja/latest/class/Set.html https://i.loveruby.net/ja/rhg/book/name.html https://techlife.cookpad.com/entry/2018/12/27/093914 https://qiita.com/kitadakyou/items/6f877edd263f097e78f4
Recommended Posts