[Ruby on Rails] Understand why Set # include? Is so fast

Introduction

This article is the 18th day of Tabelog Advent Calendar 2020.

Hello, this is @an_sony of eating log Web engineer. I usually develop a reservation service system.

Recently, the GoToEat online business has increased the number of users of the reservation service (Thank you for using it!), And the number of performance tunings has increased.

As one of the tuning methods, I often used ** to speed up the element search process of the list by using Set # include? Instead of Array # include? **, but at one point, "that , Array and Set look similar, but why is this process so fast? "

Can you answer this reason?

It would be nice to look at the contents (implementation) of the Set class to clarify, but by the way, I have never seen the contents of Ruby or Rails. .. So I think it's a good opportunity, and I'll try to clarify it by looking at the contents.

↓ People who want to read

・ People who do not know or understand the Set class ・ Ruby, Rails beginner to intermediate

What is the Set class in the first place?

A class that handles "sets". A set is a collection of unique objects. There is no particular ordering relationship between the elements.

You can create a set by passing an array. At this time, the duplicate element is deleted.

[1] pry(main)> s = Set.new([1, 2, 1])
=> #<Set: {1, 2}>

Set # include? Uses Array # include? in the same way.

[2] pry(main)> s.include?(1)
=> true
[3] pry(main)> s.include?(3)
=> false

When do you use it?

Set # include? Is faster than Array # include?, Especially when the number of elements is large. (Measurement comparison is not performed here), so it is very useful when handling a large amount of data such as batch processing.

So why is it so fast?

Looking at the official reference, there was an explanation like this.

It has both the arithmetic function of Array and the high-speed search function of Hash.
Use Hash as internal memory.

It's a class that takes advantage of Array and Hash. And it seems that it is Hash's search function that realizes the speedup.

What it means is that it is realized by the "hash method" which is one of the data search algorithms. The hash method is a method that allows you to search for specific data in one shot by determining the storage position of the data when storing it in memory. The storage position is calculated by a fixed formula (hash function). * To be exact, Ruby uses the chain method. This makes it possible to search for a large amount of data at high speed.

Take a look at the contents

Let's take a look at the contents of the Set class. First of all, this is the content when Set.new is done ↓

# Creates a new set containing the elements of the given enumerable
# object.
#
# If a block is given, the elements of enum are preprocessed by the
# given block.
#
#     Set.new([1, 2])                       #=> #<Set: {1, 2}>
#     Set.new([1, 2, 1])                    #=> #<Set: {1, 2}>
#     Set.new([1, 'c', :s])                 #=> #<Set: {1, "c", :s}>
#     Set.new(1..5)                         #=> #<Set: {1, 2, 3, 4, 5}>
#     Set.new([1, 2, 3]) { |x| x * x }      #=> #<Set: {1, 4, 9}>
def initialize(enum = nil, &block) # :yields: o
  @hash ||= Hash.new(false)

  enum.nil? and return

  if block
    do_with_enum(enum) { |o| add(block[o]) }
  else
    merge(enum)
  end
end

https://github.com/ruby/ruby/blob/master/lib/set.rb#L90-L100

Suddenly Hash.new. If you do Set.new ([1, 2, 1]), you are calling the merge method because there is no block. Let's take a look at the merge method.

# Merges the elements of the given enumerable object to the set and
# returns self.
def merge(enum)
  if enum.instance_of?(self.class)
    @hash.update(enum.instance_variable_get(:@hash))
  else
    do_with_enum(enum) { |o| add(o) }
  end

  self
end

https://github.com/ruby/ruby/blob/master/lib/set.rb#L438-L448

In the case of Set.new ([1, 2, 1]), it seems that you are calling add. What add is doing is

# Adds the given object to the set and returns self.  Use +merge+ to
# add many elements at once.
#
#     Set[1, 2].add(3)                    #=> #<Set: {1, 2, 3}>
#     Set[1, 2].add([3, 4])               #=> #<Set: {1, 2, [3, 4]}>
#     Set[1, 2].add(2)                    #=> #<Set: {1, 2}>
def add(o)
  @hash[o] = true
  self
end
alias << add

https://github.com/ruby/ruby/blob/master/lib/set.rb#L350-L360

The key of @hash is set.

That is, when you run Set.new ([1, 2, 1]) internally

@hash = {1=>true, 2=>true}

It seems that a Hash instance variable is created like this. So that's it~

So what is Set # include? doing? (I can predict it ...)

# Returns true if the set contains the given object.
#
# Note that <code>include?</code> and <code>member?</code> do not test member
# equality using <code>==</code> as do other Enumerables.
#
# See also Enumerable#include?
def include?(o)
  @hash[o]
end
alias member? include?

https://github.com/ruby/ruby/blob/master/lib/set.rb#L242-L251

Yes, it is judged whether the argument o exists in the key of @hash. Well, that's right.

In other words, the contents of Set # include? -Create a Hash type instance variable and set a value in key -Determine whether the value of the argument of include? Exists in key

What you are doing is the same as Hash # include?.

See also the complexity order

What is the computational complexity order? Please refer to here for people ↓ Reference: Computational complexity order

Method Computational complexity order
Set#include?, Hash#Include? O(logn)
Array#Include? O(n)

Set # include? And Hash # include? Are O (logn). Therefore, even if the number of elements increases, the processing does not slow down easily.

On the other hand, Array # include? Since the amount of calculation is O (n), the processing becomes slower as the number of elements increases. Let's see the contents of Array # include? And clarify the reason why it is slow. (It's C language ...)

/*
 *  call-seq:
 *    array.include?(obj) -> true or false
 *
 *  Returns +true+ if for some index +i+ in +self+, <tt>obj == self[i]</tt>;
 *  otherwise +false+:
 *    [0, 1, 2].include?(2) # => true
 *    [0, 1, 2].include?(3) # => false
 */

VALUE
rb_ary_includes(VALUE ary, VALUE item)
{
    long i;
    VALUE e;

    for (i=0; i<RARRAY_LEN(ary); i++) {
  e = RARRAY_AREF(ary, i);
  if (rb_equal(e, item)) {
      return Qtrue;
  }
    }
    return Qfalse;
}

https://github.com/ruby/ruby/blob/master/array.c#L5102-L5125

Only the length of ary is looped by the for statement. Therefore, if the number of ary elements is large, the loop processing will be longer accordingly. That's why it's O (n).

Further investigation revealed

Hash # include? Has also evolved recently

I was also investigating the Hash class, but when Ruby 2.6 introduced a technology called Transient Heap, I found an article that devised a Hash search method.

[Transient Heap device: Hash implementation](https://techlife.cookpad.com/entry/2018/12/27/093914#Transient-Heap-%E3%81%AE%E5%B7%A5%E5% A4% ABHash-% E3% 81% AE% E5% AE% 9F% E8% A3% 85 )

Set that is also used in other languages

Although it is another language, Python also has a Set type, which seems to speed up element search compared to the List type. This speedup has the same mechanism and is realized by the hash method.

[What made it explosive just by changing from "in list" to "in set" in Python and why] (https://qiita.com/kitadakyou/items/6f877edd263f097e78f4)

Summary

Now we have a clear idea of ​​how Set # include? Works and why it's so fast. When I usually write code, I think that I don't pay much attention to the contents of classes and methods, but knowing how it works helps me understand the underlying technology, so when I have time, I use Ruby. We recommend that you take a look inside the Rails.

Tomorrow is @ saratani's "A story about how Asana made daily MTG a little easier". looking forward to!

Reference URL

https://docs.ruby-lang.org/ja/latest/class/Set.html https://i.loveruby.net/ja/rhg/book/name.html https://techlife.cookpad.com/entry/2018/12/27/093914 https://qiita.com/kitadakyou/items/6f877edd263f097e78f4

Recommended Posts

[Ruby on Rails] Understand why Set # include? Is so fast
[Ruby on Rails] What is Bcrypt?
Ruby on Rails Elementary
Ruby on Rails basics
Ruby On Rails Association
Ruby on rails learning record -2020.10.03
Ruby on rails learning record -2020.10.04
[Ruby on Rails] Debug (binding.pry)
Ruby on rails learning record -2020.10.05
Ruby on rails learning record -2020.10.09
Ruby on Rails config configuration
Ruby on Rails basic learning ①
[Ruby on Rails] about has_secure_password
Ruby on rails learning record-2020.10.07 ②
Why Kotlin is so useful
Commentary on partial! --Ruby on Rails
Ruby on rails learning record-2020.10.07 ①
Cancel Ruby on Rails migration
Ruby on rails learning record -2020.10.06
Beginner Ruby on Rails What I learned is being summarized
Ruby on Rails validation summary
Ruby on Rails Basic Memorandum
Understand code coverage with Rspec, the Ruby on Rails test framework
Ruby on Rails Overview (Beginner Summary)
[Ruby on Rails] Read try (: [],: key)
[Ruby on Rails] yarn install --check-files
Ruby on Rails variable, constant summary
[Ruby on Rails] Introduced paging function
Progate Ruby on Rails5 Looking Back
Divide by Ruby! Why is it 0?
How to use Ruby on Rails
Ruby on Rails Japanese-English support i18n
(Ruby on Rails6) "Erase" posted content
[Ruby on Rails] CSV output function
Ruby on Rails 6.0 environment construction memo
[Ruby on Rails] Confirmation page creation
Ruby On Rails devise routing conflict
[Ruby on Rails] Comment function implementation
[Ruby on Rails] DM, chat function
[Ruby on Rails] Convenient helper method
[Ruby on Rails] Stop "looping until ..."
"" Application.js "is not present in the asset pipeline" error in Ruby on Rails
[Ruby On Rails] How to search the contents of params using include?
[Ruby on Rails] Introduction of initial data
[Ruby on Rails] Search function (not selected)
[Rails] Addition of Ruby On Rails comment function
[Ruby on Rails] Creating an inquiry form
[Ruby on Rails] View test with RSpec
[Ruby on Rails] How to use CarrierWave
[Ruby on Rails] Code check using Rubocop-airbnb
[Ruby on Rails] 1 model CRUD (Routing Main)
[Technical memo] What is "include" in Ruby?
Ruby on Rails installation method [Mac edition]
[Ruby on Rails] model, controller terminal command
Let's summarize "MVC" of Ruby on Rails
Ruby on Rails model creation / deletion command
[Ruby on Rails] About bundler (for beginners)
part of the syntax of ruby ​​on rails
Ruby on Rails6 Practical Guide cp7 ~ cp9 [Memo]
[Ruby on Rails] Follow function implementation: Bidirectional
Notes on using FCM with Ruby on Rails