[RUBY] A story about reducing memory consumption to 1/100 with find_in_batches

How to handle large amounts of data with reduced memory consumption

Are you writing memory-aware code in Rails?

Rails uses Ruby's ** garbage collection * 1 **, so you can write code without worrying about memory release.

* 1 Ruby collects objects that are no longer used and automatically releases the memory.

Therefore, there is a case where the production server suddenly goes down (due to a memory error) without noticing it, even though the implementation eats up memory unknowingly.

The reason I can say that is because this phenomenon happened at the site where I am working now.

I was in charge of implementing and modifying it myself, but I learned a lot from that experience, so I will leave a note so that I will not forget it.

Research of cause

First of all, you have to investigate where the memory error is.

I used ʻObjectSpace.memsize_of_all` to investigate memory usage in Rails.

By using this method, you can investigate the memory usage consumed by all living objects in bytes.

Set this method as a checkpoint in the place where the execution process is likely to drop, and steadily investigate where the memory is consumed in large quantities.

■ Usage example to check memory usage

class Hoge
  def self.hoge
    puts 'Number of object memories before memory expansion by map'
    puts '↓'
    puts ObjectSpace.memsize_of_all <====Checkpoint
    array = ('a'..'z').to_a
    array.map do |item|             <==== ①
      puts "#{item}Object memory count"
      puts '↓'
      puts ObjectSpace.memsize_of_all <====Checkpoint
      item.upcase
    end
  end
end

■ Execution result

irb(main):001:0> Hoge.hoge
Number of object memories before memory expansion by map
↓
137789340561

Number of object memories in a
↓
137789342473

Number of object memories in b
↓
137789342761

Number of object memories in c
↓
137789343049

Number of object memories in d
↓
137789343337

Number of object memories in e
↓
137789343625

.
.
.

Number of object memories of x
↓
137789349097
Number of object memories in y
↓
137789349385
Number of object memories in z
↓
137789349673
=> ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z"]

From this execution result, you can see that the data passed by map is first expanded in memory at once, and the memory consumption has increased there. (Part ①)

You can also see that the memory consumption increases each time the loop is processed.

There is no problem if the process is simple like this sample code.

If the amount of data passed is large and the implementation performed by loop processing is complicated, memory consumption will be squeezed.

** I get a memory error (an error that occurs when memory processing cannot keep up). ** **

This investigation was also investigated by the above procedure, and as a result, it was concluded that a memory error occurred because the amount of data passed was large and the heavy processing of spitting out queries with map was implemented.

Countermeasure

I understand the cause.

Next, let's think about countermeasures.

The first measures I came up with are the following three.

1.Increase memory with the power of money
2. Thread(thread)Concurrent processing with
3.Batch processing

1. Increase memory with the power of gold

To be honest, this is the fastest, and you only have to raise the memory specs of the server with the power of money, so let's do this!

I thought.

There is no memory-intensive implementation other than this process, so I thought it would be foolish to spend money just for this part, so I stopped this idea.

2. Make parallel processing with Thread

I came up with the parallel processing of Ruby as the next countermeasure, but if the bottleneck is the processing time (timeout), it is correct because it will be faster if multiple threads are set up and calculated in parallel and merged, but this time Since the bottleneck is memory pressure due to a memory error, the amount of data handled by multiple threads does not change, so it is expected that a memory error will occur in the end, so I stopped this idea.

3. Batch process

The biggest cause of this memory error is a memory error caused by expanding a large amount of data at once and repeating high-load processing in a loop.

Therefore, I thought that it would be good if a large amount of data could be implemented while saving memory if it was implemented in batch processing in units of 1,000 without expanding the memory at once.

Rails has a method called find_in_batches, which can be used to process 1000 items by default.

Example) 10,1 for 000,Divide into 000 processes and divide into 10 batch processes.
find_in_An image that uses less memory by limiting processing with batches.

Conclusion

** Batch processing using find_in_batches **

Implementation

Once you know how to deal with it, all you have to do is implement it.

Let's implement it. (Since it is not possible to actually show the company code, only the image is shown)

■ Implementation image

User.find_in_batches(batch_size: 1000) do |users|
  #Something processing
end

Even if 10,000 User data are acquired, if find_in_batches is used, 1000 will be processed at a time.

In other words, it is an image that divides into 10,000/1000 = 10 processes.

result

Memory consumption has been reduced to 1/100.

Ideas for better

** However, the biggest disadvantage of this implementation is that it takes too much processing time. ** **

If you are using heroku etc., this implementation will result in ** RequestTimeOut error * 1 **.

* 1 In heroku, processing that takes 30 seconds or more will result in a RequestTimeOut error.

Therefore, I think it is better to move this high-load processing implementation to background processing.

If you are using Rails, you can do this by using Sidekiq.

I think you should work with the following procedure.

STEP1. find_in_Use batches to reduce memory consumption

STEP2.When STEP1 is completed, it will take some time, but it should be in a working state without a memory error.
However, since it takes time to process, move the process to the background.

Summary

At first, I thought it was an annoying task.

I learned a lot, and I'm glad I implemented it now.

reference

https://techblog.lclco.com/entry/2019/07/31/180000 https://qiita.com/kinushu/items/a2ec4078410284b9856d

Recommended Posts

A story about reducing memory consumption to 1/100 with find_in_batches
A story about trying to operate JAVA File
[PHP] A story about outputting PDF with TCPDF + FPDI
A story about trying hard to decompile JAR files
A story about developing ROS called rosjava with java
A story about PKIX path building failed when trying to deploy to tomcat with Jenkins
A story stuck with NotSerializableException
A story addicted to toString () of Interface proxied with JdkDynamicAopProxy
A confused story about a ternary operator with multiple conditional expressions
A story about misunderstanding how to use java scanner (memo)
A story that I struggled to challenge a competition professional with Java
[Note] A story about changing Java build tools with VS Code
A story of connecting to a CentOS 8 server with an old Ansible
A story about hitting the League Of Legends API with JAVA
A story about having a hard time aligning a testing framework with Java 6
A story about using the CoreImage framework to erase stains with Swift and implement a blur erase function
A story about changing jobs from a Christian minister (apprentice) to a web engineer
A story about converting character codes from UTF-8 to Shift-jis in Ruby
A story about sending a pull request to MinGW to update the libgr version
A story addicted to JDBC Template placeholders
A little addictive story with def initialize
A story about saving an image with carrierwave in a nested form using a form object.
A story about creating a library that operates next-generation sequencer data with Ruby ruby-htslib
It's a pain to deal with old dates
A note about adding Junit 4 to Android Studio
A story addicted to EntityNotFoundException of getOne of JpaRepository
A story about Java 11 support for Web services
[Rails] rails new to create a database with PostgreSQL
A story that took time to establish a connection
Convert a string to a character-by-character array with swift
A story about a very useful Ruby Struct class
A story about Apache Wicket and atomic design
A story about making a Builder that inherits the Builder
Transition to a view controller with Swift WebKit
A story packed with Java's standard input Scanner
Rip a CD to MP3 with Ubuntu 18.04 LTS
A story about a new engineer reading a passion programmer
I tried to break a block with java (1)
[Jackson] A story about converting the return value of BigDecimal type with a custom serializer.
A story about creating a service that proposes improvements to a website using a machine learning API
A story about going to a Docker + k8s study session [JAZUG Women's Club x Java Women's Club]
A story that failed when connecting to CloudSQL by running Sprint-boot with kubernetes (GKE)
A story that stumbled when deploying a web application created with Spring Boot to EC2
A story about an error collating checksum values after npm install with Laravel Homestead