Convenient method find_each in ActiveRecord I wrote about the story because the implementation method was not good and it looped infinitely, and the phenomenon that it was forcibly stopped by the OOM killer in the production environment occurred.

All tables and implementations in this article are customized for the article: bow:

What is find_each

What find_each does is not to get a large amount of data at once and loop it, but to get it in a fixed unit (default 1,000) and loop it. When dealing with a large amount of data, if you acquire it all at once, you will use a large amount of memory, but you can process it with a small amount of memory by dividing it using find_each.

It's hard to understand even if you write it in words. The following is an execution example.

#When there are 10,000 users
pry(main)> User.all.count
   (1.1ms)  SELECT COUNT(*) FROM `users`
=> 10000

#If you use each, 10,000 items will be acquired at once.
pry(main)> User.all.each {|user| p user.id}
  User Load (4.5ms)  SELECT `users`.* FROM `users`
1
2
3
...
10000

# find_1 with each,Get 000 items at a time
[8] pry(main)> User.all.find_each {|user| p user.id}
  User Load (3.9ms)  SELECT `users`.* FROM `users` ORDER BY `users`.`id` ASC LIMIT 1000
1
2
3
...
1000
  User Load (0.8ms)  SELECT `users`.* FROM `users` WHERE `users`.`id` > 1000 ORDER BY `users`.`id` ASC LIMIT 1000
1001
1002
1003
...
2000
  User Load (0.8ms)  SELECT `users`.* FROM `users` WHERE `users`.`id` > 2000 ORDER BY `users`.`id` ASC LIMIT 1000
2001
...
10000

See the Rails guide for more details. https://railsguides.jp/active_record_querying.html#find-each

find_each is an infinite loop! !!

It's a convenient method find_each, but as I wrote at the beginning, I made a mistake in the implementation and made an infinite loop. Before explaining the infinite loop implementation, let's see how find_each works in the first place.

How find_each works

Let's check how it works using the execution example shown at the beginning.

Let's take a look at the SQL that is issued first.

SELECT `users`.* FROM `users` ORDER BY `users`.`id` ASC LIMIT 1000

1000 items are acquired by specifying limit 1000. What should be noted here is that they are arranged in ascending order of PRIMARY KEY (id).

So how do you get the next 1000 cases?

SELECT `users`.* FROM `users` WHERE `users`.`id` > 1000 ORDER BY `users`.`id` ASC LIMIT 1000

When getting the next 1000 items using SQL, LIMIT and OFFSET are often used, but OFFSET is not used in this SQL. Instead, you can see that the where clause has an additional requirement of ʻusers.id> 1000`.

ʻUsers.id> 1000 1000 is the last id of the first 1000 obtained. Since the data this time is arranged in ascending order of id, the next 1000 items can be acquired without using OFFSET by specifying ʻusers.id> 1000, which means to acquire data larger than the last id. doing.

Implementation that becomes an infinite loop

Find_each where an infinite loop occurred was implemented as follows. What will happen?

# users.id and books.Since only title is used, only necessary data is acquired by select
Book.joins(:user).select('users.id, books.title').find_each do |book|
  p "user_id: #{book.id}, title: #{book.title}"
end

At first, the following SQL is issued.

SELECT users.id, books.title FROM `books` INNER JOIN `users` ON `users`.`id` = `books`.`user_id` ORDER BY `books`.`id` ASC LIMIT 1000

There is no particular problem with the first SQL. So what about the SQL that gets the next 1000 records?

SELECT users.id, books.title FROM `books` INNER JOIN `users` ON `users`.`id` = `books`.`user_id` WHERE `books`.`id` > 1000 ORDER BY `books`.`id` ASC LIMIT 1000

The condition books.id> 1000 has been added. The condition 1000 is the id of the last 1000 data obtained first. It's hard to notice if you look only at SQL, but the id you get with this SQL is ʻusers.id instead of books.id. Therefore, 1000, which is set to books.id> 1000`, specifies the users.id of the last data.

In this SQL, the order of books.id is ascending, and the order of users.id is not particularly controlled. Therefore, it is possible that the last data of the next 1000th item is books.id: 2000, users.id: 1. In this case, the SQL to be issued next will be as follows.

SELECT users.id, books.title FROM `books` INNER JOIN `users` ON `users`.`id` = `books`.`user_id` WHERE `books`.`id` > 1 ORDER BY `books`.`id` ASC LIMIT 1000

The condition will be books.id> 1, and the data before the previous SQL (books.id> 1000) will be fetched. In this way, if users.id whose order is not controlled is included in the condition of books.id, the data to be acquired will be mixed up, and in the worst case, the same data will be acquired many times and an infinite loop will occur. I will.

The troublesome part of this problem is not always an infinite loop, and depending on the data, books.id> # {last users.id} may happen to be specified nicely and complete like that. There is. In that case, it will not be an error, but it will be a bug that is hard to notice that the data is slightly strange, so it may be better to make it an infinite loop.

How to fix

In the case of the above example, if you do not narrow down the acquisition column with select, books.id will also be acquired, so it will work correctly. Even if you narrow down the acquisition column with select, it will work correctly if you also acquire books.id properly as shown below.

Book.joins(:user).select('books.id AS id, users.id AS user_id, books.title').find_each do |book|
  p "user_id: #{book.user_id}, title: #{book.title}"
end

If you fix it as above, the fix is complete, but I think the problem this time was that there was no automated test. There was a test that passed the corresponding process, but I didn't write a test that loops find_each more than 2 times. If you have a test, you're likely to have noticed the bug as it loops infinitely or gives strange results. With this as a trigger, I also added a test where find_each loops more than 2 times.

Summary

Even if you understand the mechanism of find_each correctly, it is difficult to notice this bug just by checking it on the desk such as code review. In addition, it is a rare process that the number of cases exceeds 1000, and the unit of 1000 is just a matter of program, so it was hidden as a potential bug for a while without being noticed even in a black box operation test.

Thinking about how I should have noticed this in advance, I thought that I had to do a test in which find_each loops 2 in a white box test. It's a waste to run a white-box test manually for one time, so it's a good idea to write an automated test so that it can be verified continuously.