[Ruby on Rails] Understand why Set # include? Is so fast

Introduction

This article is the 18th day of Tabelog Advent Calendar 2020.

Hello, this is @an_sony of eating log Web engineer. I usually develop a reservation service system.

Recently, the GoToEat online business has increased the number of users of the reservation service (Thank you for using it!), And the number of performance tunings has increased.

As one of the tuning methods, I often used ** to speed up the element search process of the list by using Set # include? Instead of Array # include? **, but at one point, "that , Array and Set look similar, but why is this process so fast? "

Can you answer this reason?

It would be nice to look at the contents (implementation) of the Set class to clarify, but by the way, I have never seen the contents of Ruby or Rails. .. So I think it's a good opportunity, and I'll try to clarify it by looking at the contents.

↓ People who want to read

・ People who do not know or understand the Set class ・ Ruby, Rails beginner to intermediate

What is the Set class in the first place?

A class that handles "sets". A set is a collection of unique objects. There is no particular ordering relationship between the elements.

You can create a set by passing an array. At this time, the duplicate element is deleted.

[1] pry(main)> s = Set.new([1, 2, 1])
=> #<Set: {1, 2}>

Set # include? Uses Array # include? in the same way.

[2] pry(main)> s.include?(1)
=> true
[3] pry(main)> s.include?(3)
=> false

When do you use it?

Set # include? Is faster than Array # include?, Especially when the number of elements is large. (Measurement comparison is not performed here), so it is very useful when handling a large amount of data such as batch processing.

So why is it so fast?

Looking at the official reference, there was an explanation like this.

It has both the arithmetic function of Array and the high-speed search function of Hash.
Use Hash as internal memory.

It's a class that takes advantage of Array and Hash. And it seems that it is Hash's search function that realizes the speedup.

What it means is that it is realized by the "hash method" which is one of the data search algorithms. The hash method is a method that allows you to search for specific data in one shot by determining the storage position of the data when storing it in memory. The storage position is calculated by a fixed formula (hash function). * To be exact, Ruby uses the chain method. This makes it possible to search for a large amount of data at high speed.

Take a look at the contents

Let's take a look at the contents of the Set class. First of all, this is the content when Set.new is done ↓

# Creates a new set containing the elements of the given enumerable
# object.
#
# If a block is given, the elements of enum are preprocessed by the
# given block.
#
#     Set.new([1, 2])                       #=> #<Set: {1, 2}>
#     Set.new([1, 2, 1])                    #=> #<Set: {1, 2}>
#     Set.new([1, 'c', :s])                 #=> #<Set: {1, "c", :s}>
#     Set.new(1..5)                         #=> #<Set: {1, 2, 3, 4, 5}>
#     Set.new([1, 2, 3]) { |x| x * x }      #=> #<Set: {1, 4, 9}>
def initialize(enum = nil, &block) # :yields: o
  @hash ||= Hash.new(false)

  enum.nil? and return

  if block
    do_with_enum(enum) { |o| add(block[o]) }
  else
    merge(enum)
  end
end

https://github.com/ruby/ruby/blob/master/lib/set.rb#L90-L100

Suddenly Hash.new. If you do Set.new ([1, 2, 1]), you are calling the merge method because there is no block. Let's take a look at the merge method.

# Merges the elements of the given enumerable object to the set and
# returns self.
def merge(enum)
  if enum.instance_of?(self.class)
    @hash.update(enum.instance_variable_get(:@hash))
  else
    do_with_enum(enum) { |o| add(o) }
  end

  self
end

https://github.com/ruby/ruby/blob/master/lib/set.rb#L438-L448

In the case of Set.new ([1, 2, 1]), it seems that you are calling add. What add is doing is

# Adds the given object to the set and returns self.  Use +merge+ to
# add many elements at once.
#
#     Set[1, 2].add(3)                    #=> #<Set: {1, 2, 3}>
#     Set[1, 2].add([3, 4])               #=> #<Set: {1, 2, [3, 4]}>
#     Set[1, 2].add(2)                    #=> #<Set: {1, 2}>
def add(o)
  @hash[o] = true
  self
end
alias << add

https://github.com/ruby/ruby/blob/master/lib/set.rb#L350-L360

The key of @hash is set.

That is, when you run Set.new ([1, 2, 1]) internally

@hash = {1=>true, 2=>true}

It seems that a Hash instance variable is created like this. So that's it~

So what is Set # include? doing? (I can predict it ...)

# Returns true if the set contains the given object.
#
# Note that <code>include?</code> and <code>member?</code> do not test member
# equality using <code>==</code> as do other Enumerables.
#
# See also Enumerable#include?
def include?(o)
  @hash[o]
end
alias member? include?

https://github.com/ruby/ruby/blob/master/lib/set.rb#L242-L251

Yes, it is judged whether the argument o exists in the key of @hash. Well, that's right.

In other words, the contents of Set # include? -Create a Hash type instance variable and set a value in key -Determine whether the value of the argument of include? Exists in key

What you are doing is the same as Hash # include?.

Method	Computational complexity order
Set#include?, Hash#Include?	O(logn)
Array#Include?	O(n)

Further investigation revealed

`Hash # include?` Has also evolved recently

I was also investigating the Hash class, but when Ruby 2.6 introduced a technology called Transient Heap, I found an article that devised a Hash search method.

[Transient Heap device: Hash implementation](https://techlife.cookpad.com/entry/2018/12/27/093914#Transient-Heap-%E3%81%AE%E5%B7%A5%E5% A4% ABHash-% E3% 81% AE% E5% AE% 9F% E8% A3% 85 )

Set that is also used in other languages

Although it is another language, Python also has a Set type, which seems to speed up element search compared to the List type. This speedup has the same mechanism and is realized by the hash method.

[What made it explosive just by changing from "in list" to "in set" in Python and why] (https://qiita.com/kitadakyou/items/6f877edd263f097e78f4)

Summary

Now we have a clear idea of how Set # include? Works and why it's so fast. When I usually write code, I think that I don't pay much attention to the contents of classes and methods, but knowing how it works helps me understand the underlying technology, so when I have time, I use Ruby. We recommend that you take a look inside the Rails.

Tomorrow is @ saratani's "A story about how Asana made daily MTG a little easier". looking forward to!

Reference URL

https://docs.ruby-lang.org/ja/latest/class/Set.html https://i.loveruby.net/ja/rhg/book/name.html https://techlife.cookpad.com/entry/2018/12/27/093914 https://qiita.com/kitadakyou/items/6f877edd263f097e78f4