[Ruby] [Rails] find_each is infinite loop and eats up memory in production environment

5 minute read

Convenience method find_each in ActiveRecord I wrote about that story because the implementation was not good and it looped infinitely, and in the production environment there was an event that it was forcibly stopped by the OOM killer.

  • All tables and implementations mentioned in this article are customized for the article :bow:

What is # find_each

What find_each does is that instead of fetching a large amount of data at once and looping, it will fetch and loop by a fixed unit (default 1,000). When handling a large amount of data, a large amount of memory will be used if it is obtained in a batch, but it can be processed with a small amount of memory by using find_each and dividing and obtaining.

It’s hard to understand in words. The following is an execution example.

If there are 10,000 # users
pry(main)> User.all.count
   (1.1ms) SELECT COUNT(*) FROM `users`
=> 10000

If you use # each, you will get 10,000 records in a batch
pry(main)> User.all.each {|user| p user.id}
  User Load (4.5ms) SELECT `users`.* FROM `users`
1
2
3
...
10000

If you use # find_each, you will get 1,000 records each
[8] pry(main)> User.all.find_each {|user| p user.id}
  User Load (3.9ms) SELECT `users`.* FROM `users` ORDER BY `users`.`id` ASC LIMIT 1000
1
2
3
...
1000
  User Load (0.8ms) SELECT `users`.* FROM `users` WHERE `users`.`id` >1000 ORDER BY `users`.`id` ASC LIMIT 1000
1001
1002
1003
...
2000
  User Load (0.8ms) SELECT `users`.* FROM `users` WHERE `users`.`id` >2000 ORDER BY `users`.`id` ASC LIMIT 1000
2001
...
10000

See the Rails Guide for more details. https://railsguides.jp/active_record_querying.html#find-each

find_each is an infinite loop! !!

It is a convenient method find_each, but as I wrote at the beginning, I made a mistake in the implementation and caused an infinite loop. Before explaining the implementation that became an infinite loop, let’s check how find_each works in the first place.

How find_each works

Let’s check how it works by using the first example.

Let’s look at the SQL that is issued first.

SELECT `users`.* FROM `users` ORDER BY `users`.`id` ASC LIMIT 1000

1000 is acquired by specifying limit 1000. What should be noted here is that they are arranged in ascending order of PRIMARY KEY (id).

So how do you get the next 1000?

SELECT `users`.* FROM `users` WHERE `users`.`id` >1000 ORDER BY `users`.`id` ASC LIMIT 1000

It is common to use LIMIT and OFFSET when fetching the next 1000 items using SQL, but OFFSET is not used in this SQL. Instead, you can see that the requirement for users.id >1000 is increasing in the where clause.

1000 of users.id> 1000 is the last id of the first 1000 retrieved. Since the data this time is arranged in ascending order of id, by specifying users.id >1000 which means that data larger than the last id is acquired, the next 1000 items will be acquired without using OFFSET. doing.

Implementation that becomes an infinite loop

Find_each that caused an infinite loop was implemented as follows. What will happen?

Since only # users.id and books.title are used, only the necessary data is acquired by select
Book.joins(:user).select('users.id, books.title').find_each do |book|
  p "user_id: #{book.id}, title: #{book.title}"
end

At first, the following SQL is issued.

SELECT users.id, books.title FROM `books` INNER JOIN `users` ON `users`.`id` = `books`.`user_id` ORDER BY `books`.`id` ASC LIMIT 1000

There is no particular problem with the first SQL. So what happens to the SQL that gets the next 1000 records?

SELECT users.id, books.title FROM `books` INNER JOIN `users` ON `users`.`id` = `books`.`user_id` WHERE `books`.`id` >1000 ORDER BY `books`.` id` ASC LIMIT 1000

A condition such as books.id> 1000 has been added. The condition 1000 is the id of the last 1000 data retrieved first. It is hard to notice that you are only looking at the SQL, but the id acquired in this SQL is not books.id but users.id. Therefore, 1000 which is set to books.id >1000 is specified users.id of the last data.

In this SQL, books.id is in ascending order, and users.id is in no particular order. Therefore, it is possible that the last 1000th data item is books.id: 2000, users.id: 1. In this case, the SQL issued next will be as follows.

SELECT users.id, books.title FROM `books` INNER JOIN `users` ON `users`.`id` = `books`.`user_id` WHERE `books`.`id` >1 ORDER BY `books`.` id` ASC LIMIT 1000

The condition becomes books.id >1, and the data before the SQL one before (books.id >1000) is acquired. By entering users.id whose order is not controlled in the conditions of books.id in this way, the data to be acquired will be mixed up, and in the worst case, the same data will be acquired many times and an infinite loop will occur. I will.

The annoyance of this problem is not always an infinite loop, and depending on the data, books.id >#{last users.id} happens to be specified nicely and completes like that. There is. In that case, it does not cause an error, but it becomes a subtle bug that the data is slightly strange, so it may be better to make it an infinite loop.

How to fix

In the case of the above example, if you do not narrow down the acquisition column with select, books.id will also be acquired, so it will work properly. Even if you narrow down the acquisition column with select, it will work correctly if you also acquire books.id properly as shown below.

Book.joins(:user).select('books.id AS id, users.id AS user_id, books.title').find_each do |book|
  p "user_id: #{book.user_id}, title: #{book.title}"
end

If you correct it as above, it is complete, but I think the problem this time was that there was no automatic test. I had a test that passed the corresponding process, but I did not write a test that makes find_each more than 2 loops. If there is a test, it is likely that you were aware of the bug because it will loop infinitely or the result will be strange. With this as a trigger, I added a test that find_each has 2 or more loops.

Summary

Even if you understand the mechanism of find_each correctly, I think it is difficult to notice the bug this time only by checking on the desk such as code review. In addition, it is rare that more than 1000 cases are processed, and the unit of 1000 cases is just a program convenience, so it was hidden as a latent bug for a while without being noticed by the black box operation test.

Thinking about how I should have noticed this in advance, I thought that there was no choice but to do a test in which find_each had two loops in the white box test. It’s a waste to run the white box test manually for one time, so it’s a good idea to write a proper automatic test so that it can be continuously verified.