[JAVA] About batch processing performance due to difference in primary key generation method in JPA

Overview

When I was investigating the performance when performing batch processing with JPA (Hibernate), when GenerationType.IDENTITY is used as the primary key generation method on the Batch processing page of Hibernate Documentation, batch processing is disabled at the JDBC level. I saw the description (the relevant part is quoted below), so I tried to find out how much difference there is due to the difference in Generation Type.

Chapter 15. Batch processing Hibernate disables insert batching at the JDBC level transparently if you use an identity identifier generator.

We also investigated the case where the same batch processing was performed by JDBC without using JPA as a reference record. (Using Spring Jdbc Template)

environment

reference

Method of verification

Performance verification was done by measuring the execution time of JUnit Unit tests. The Unit test code was created in the environment of Spring Boot and Spring Data JPA, and local PostgreSQL and MySQL were used for the database. (Additionally, this validation is not about which database is faster.)

Preferences and verification code

Data source settings for application.yml

** for PostgreSQL **

application.yml


spring:
  datasource:
    url: jdbc:postgresql://localhost:5432/sample_db
    username: test_user
    password: test_user
    driverClassName: org.postgresql.Driver

** For MySQL **

application.yml


spring:
  datasource:
    url: jdbc:mysql://localhost:3306/sample_db
    username: test_user
    password: test_user
    driverClassName: com.mysql.jdbc.Driver

DataSource configuration class

@Configuration
@EnableTransactionManagement
@EnableJpaRepositories(
    basePackages = {"com.example.domain.repository"},
    entityManagerFactoryRef = "entityManagerFactory",
    transactionManagerRef = "transactionManager"
)
public class DataSourceConfigure {

    @ConfigurationProperties(prefix = "spring.datasource")
    @Bean
    public DataSource datasource() {
        DataSource dataSource = DataSourceBuilder.create().build();
        return dataSource;
    }

    @Bean
    public LocalContainerEntityManagerFactoryBean entityManagerFactory(
            EntityManagerFactoryBuilder builder) {
        LocalContainerEntityManagerFactoryBean factory = builder
                .dataSource(datasource())
                .persistenceUnit("default")
                .packages("com.example.domain.entity")
                .build();
        return factory;
    }

    @Bean
    public PlatformTransactionManager transactionManager(
            EntityManagerFactory entityManagerFactory) {
        JpaTransactionManager tm = new JpaTransactionManager();
        tm.setEntityManagerFactory(entityManagerFactory);
        tm.setDefaultTimeout(300);
        tm.afterPropertiesSet();
        return tm;
    }
}

Unit test code

First, 100,000 pieces of dummy data were generated and used for insert processing, and the time taken from the start of the test to the normal end was measured. I also used JUnit's Stopwatch rule to measure the time.

Test code that uses EntityManager

@RunWith(SpringRunner.class)
@SpringBootTest
@ContextConfiguration(classes = {
        TestApplication.class, DataSourceConfigure.class})
public class MemoWithEntityManagerTest {

    @Rule
    public final Stopwatch stopwatch = StopwatchRule.create();

    @PersistenceContext
    private EntityManager entityManager;
   
    @Value("${spring.jpa.properties.hibernate.jdbc.batch_size:30}")
    private int batchSize;

    private List<Memo> lists;

    @Before
    public void setup() {
        lists = new ArrayList<>();
        for (int i=0; i<100_000; i++) {
            String title = UUID.randomUUID().toString();
            String description = i + " : " + title;
            lists.add(Memo.of(title, description));
        }
    }

    @Test
    @Transactional
    public void testPersist() {
        for (int i=0; i<lists.size(); i++) {
            entityManager.persist(lists.get(i));
            if (i > 0 && i % batchSize == 0) {
                entityManager.flush();
                entityManager.clear();
            }
        }
        entityManager.flush();
        entityManager.clear();
    }

}

Test code that uses JdbcTemplate

@RunWith(SpringRunner.class)
@SpringBootTest
@ContextConfiguration(classes = {
        TestApplication.class, DataSourceConfigure.class})
public class MemoWithJdbcTemplateTest {

    @Rule
    public final Stopwatch stopwatch = StopwatchRule.create();

    @Autowired
    private JdbcTemplate jdbcTemplate;

    private List<Memo> lists;

    @Before
    public void setup() {
        lists = new ArrayList<>();
        for (int i=0; i<100_000; i++) {
            String title = UUID.randomUUID().toString();
            String description = i + " : " + title;
            lists.add(Memo.of(title, description));
        }
    }

    @Test
    @Transactional
    public void testBatchUpdate() {

        String sql = "insert into memo (title, description, done, updated) values (?, ?, ?, ?)";

        BatchPreparedStatementSetter setter = new BatchPreparedStatementSetter() {
            @Override
            public void setValues(PreparedStatement ps, int i) throws SQLException {
                Memo memo = lists.get(i);
                ps.setString(1, memo.getTitle());
                ps.setString(2, memo.getDescription());
                ps.setBoolean(3, memo.getDone());
                ps.setTimestamp(4, new java.sql.Timestamp(memo.getUpdated().getTime()));
            }
            @Override
            public int getBatchSize() {
                return lists.size();
            }
        };

        jdbcTemplate.batchUpdate(sql, setter);
    }

}

Stopwatch rule for time measurement

public class StopwatchRule {

    public static Stopwatch create() {
        return new Stopwatch() {
            @Override
            public long runtime(TimeUnit unit) {
                return super.runtime(unit);
            }
            @Override
            protected void succeeded(long nanos, Description description) {
                System.out.println(description.getMethodName() + " succeeded, time taken " + toSeconds(nanos));
            }
            @Override
            protected void failed(long nanos, Throwable e, Description description) {
                super.failed(nanos, e, description);
            }
            @Override
            protected void skipped(long nanos, AssumptionViolatedException e, Description description) {
                super.skipped(nanos, e, description);
            }
            @Override
            protected void finished(long nanos, Description description) {
                System.out.println(description.getMethodName() + " finished, time taken " + toSeconds(nanos));
            }
            private double toSeconds(long nanos) {
                return (double) nanos / 1000000000.0;
            }
        };
    }

}

Measurement pattern 1) PostgreSQL + GenerationType.SEQUENCE

It is a pattern that is executed as a batch process if you specify the batch size.

table definition

Use sequence to generate the primary key.

CREATE TABLE memo (
  id bigint NOT NULL,
  title varchar(255) NOT NULL,
  description text NOT NULL,
  done boolean DEFAULT false NOT NULL,
  updated timestamp without time zone DEFAULT current_timestamp NOT NULL,
  CONSTRAINT memo_pkey PRIMARY KEY(id)
);

CREATE SEQUENCE memo_id_seq
START WITH 1
INCREMENT BY 1
NO MINVALUE
NO MAXVALUE
CACHE 1
NO CYCLE;

** Entity class **

package com.example.domain.entity;

@Entity
@Table(name="memo")
public class Memo implements Serializable {

    private static final long serialVersionUID = -7888970423872473471L;

    @Id
    @SequenceGenerator(name = "memo_id_seq", sequenceName = "memo_id_seq", allocationSize = 1)
    @GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "memo_id_seq")
    private Long id;

    @Column(name="title", nullable = false)
    private String title;
    @Column(name="description", nullable = false)
    private String description;
    @Column(name="done", nullable = false)
    private Boolean done;
    @Column(name="updated", nullable = false)
    @Temporal(TemporalType.TIMESTAMP)
    private Date updated;

    public static Memo of(String title, String description) {
        Memo memo = new Memo();
        memo.title = title;
        memo.description = description;
        memo.done = false;
        memo.updated = new Date();
        return memo;
    }

    // getter/setter omitted
}

Batch size setting Contents of setting when "None"

application.yml


spring:
  jpa:
    properties:
      hibernate:
        generate_statistics: true

Batch size setting Contents of setting when "Yes"

batch_size was measured in 2 patterns of 30 and 50.

application.yml


spring:
  jpa:
    properties:
      hibernate:
        generate_statistics: true
        jdbc:
          batch_size: 30

Pattern 1) Measurement result

The measurements were taken twice each.

No Batch size processing time(Seconds)
First time
processing time(Seconds)
Second time
Remarks
1 No setting 43.603787094 43.645273886 Flush every 30 cases using EntityManager,Execute clear
2 Batch size 30 35.651785986 35.052646148 Flush every 30 cases using EntityManager,Execute clear
3 Batch size 50 35.159041777 34.690860229 Flush every 50 cases using EntityManager,Execute clear

Whether or not it is being executed as a batch process was determined from the statistical information output by JPA. Set the generate_statistics parameter to true to output this information.

** No1 first stats **

The reason why the JDBC statement is executed 200,000 times for 100,000 data inserts is because the primary key value is acquired by select nextval ('memo_id_seq').

StatisticalLoggingSessionEventListener : Session Metrics {
    1535888 nanoseconds spent acquiring 1 JDBC connections;
    0 nanoseconds spent releasing 0 JDBC connections;
    454642612 nanoseconds spent preparing 200000 JDBC statements;
    38436776034 nanoseconds spent executing 200000 JDBC statements;
    0 nanoseconds spent executing 0 JDBC batches;
    0 nanoseconds spent performing 0 L2C puts;
    0 nanoseconds spent performing 0 L2C hits;
    0 nanoseconds spent performing 0 L2C misses;
    24964533986 nanoseconds spent executing 3334 flushes (flushing a total of 100000 entities and 0 collections);
    0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections)
}

** No2 1st stats **

StatisticalLoggingSessionEventListener : Session Metrics {
    2525719 nanoseconds spent acquiring 1 JDBC connections;
    0 nanoseconds spent releasing 0 JDBC connections;
    289639147 nanoseconds spent preparing 103334 JDBC statements;
    16091539786 nanoseconds spent executing 100000 JDBC statements;
    15787659766 nanoseconds spent executing 3335 JDBC batches;
    0 nanoseconds spent performing 0 L2C puts;
    0 nanoseconds spent performing 0 L2C hits;
    0 nanoseconds spent performing 0 L2C misses;
    16543820527 nanoseconds spent executing 3334 flushes (flushing a total of 100000 entities and 0 collections);
    0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections)
}

** No3 1st stats **

StatisticalLoggingSessionEventListener : Session Metrics {
    1430965 nanoseconds spent acquiring 1 JDBC connections;
    0 nanoseconds spent releasing 0 JDBC connections;
    308382598 nanoseconds spent preparing 102000 JDBC statements;
    16036841752 nanoseconds spent executing 100000 JDBC statements;
    15643983778 nanoseconds spent executing 2001 JDBC batches;
    0 nanoseconds spent performing 0 L2C puts;
    0 nanoseconds spent performing 0 L2C hits;
    0 nanoseconds spent performing 0 L2C misses;
    16309985446 nanoseconds spent executing 2000 flushes (flushing a total of 100000 entities and 0 collections);
    0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections)
}

Measurement pattern 2) PostgreSQL + GenerationType.IDENTITY

This pattern is not executed as a batch process even if the batch size is specified. To generate a primary key with GenerationType.IDENTITY, use bigserial as the table's primary key type.

table definition

Use the bigserial type to generate the primary key.

CREATE TABLE memo (
  id bigserial NOT NULL,
  title varchar(255) NOT NULL,
  description text NOT NULL,
  done boolean DEFAULT false NOT NULL,
  updated timestamp without time zone DEFAULT current_timestamp NOT NULL,
  CONSTRAINT memo_pkey PRIMARY KEY(id)
);

** Entity class **

package com.example.domain.entity;

@Entity
@Table(name="memo")
public class Memo implements Serializable {

    private static final long serialVersionUID = -7888970423872473471L;

    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;

    @Column(name="title", nullable = false)
    private String title;
    @Column(name="description", nullable = false)
    private String description;
    @Column(name="done", nullable = false)
    private Boolean done;
    @Column(name="updated", nullable = false)
    @Temporal(TemporalType.TIMESTAMP)
    private Date updated;

    public static Memo of(String title, String description) {
        Memo memo = new Memo();
        memo.title = title;
        memo.description = description;
        memo.done = false;
        memo.updated = new Date();
        return memo;
    }

    // getter/setter omitted
}

Batch size setting Contents of setting when "None"

The settings are the same as measurement pattern 1. Since the batch size setting is invalidated for GenerationType.IDENTITY, the pattern with the batch size setting is not measured.

Pattern 2) Measurement result

No Batch size processing time(Seconds)
First time
processing time(Seconds)
Second time
Remarks
1 No setting 33.262405883 33.097327786 Flush every 30 cases using EntityManager,Execute clear
2 No setting 16.062690625 15.846104684 Reference record using JdbcTemplate

Measurement pattern 3) MySQL + GenerationType.IDENTITY

This pattern, like pattern 2, is not executed as a batch process. Since MySQL does not have the same function as Sequence of PostgreSQL, GenerationType.SEQUENCE could not be verified.

table definition

CREATE TABLE IF NOT EXISTS memo (
  id BIGINT NOT NULL AUTO_INCREMENT,
  title VARCHAR(255) NOT NULL,
  description TEXT NOT NULL,
  done BOOLEAN NOT NULL DEFAULT FALSE NOT NULL,
  updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
  PRIMARY KEY (id)
) ENGINE = INNODB, CHARACTER SET = utf8mb4, COLLATE utf8mb4_general_ci;

** Entity class **

package com.example.domain.entity;

@Entity
@Table(name="memo")
public class Memo implements Serializable {

    private static final long serialVersionUID = -7888970423872473471L;

    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;

    @Column(name="title", nullable = false)
    private String title;
    @Column(name="description", nullable = false)
    private String description;
    @Column(name="done", nullable = false)
    private Boolean done;
    @Column(name="updated", nullable = false)
    @Temporal(TemporalType.TIMESTAMP)
    private Date updated;

    public static Memo of(String title, String description) {
        Memo memo = new Memo();
        memo.title = title;
        memo.description = description;
        memo.done = false;
        memo.updated = new Date();
        return memo;
    }

    // getter/setter omitted
}

Batch size setting Contents of setting when "None"

application.yml


spring:
  jpa:
    properties:
      hibernate:
        generate_statistics: true

Since the batch size setting is invalidated for GenerationType.IDENTITY, the pattern with the batch size setting is not measured.

Pattern 3) Measurement result

No Batch size processing time(Seconds)
First time
processing time(Seconds)
Second time
Remarks
1 No setting 28.461122403 27.934141586 Flush every 30 cases using EntityManager,Execute clear
2 No setting 16.115392204 15.964329785 Reference record using JdbcTemplate

Measurement pattern 4) MySQL + GenerationType.IDENTITY + rewriteBatchedStatements = true

This pattern is a pattern in which rewriteBatchedStatements is added to the condition of pattern 3. This parameter is a MySQL-specific JDBC URL parameter, and you may want to improve performance if you enable it.

table definition

Same as measurement pattern 3

** Entity class **

Same as measurement pattern 3

Batch size setting Contents of setting when "None"

Same content as measurement pattern 3

rewriteBatchedStatements

Add it as a parameter to the JDBC URL as shown below.

spring:
  datasource:
    url: jdbc:mysql://localhost:3306/sample_db?rewriteBatchedStatements=true
    username: test_user
    password: test_user
    driverClassName: com.mysql.jdbc.Driver

When this parameter is enabled, the insert statement will be rewritten as shown in the example below.

** Example **

Original insert statement

INSERT INTO tbl_name (a,b,c) VALUES (1,2,3);
INSERT INTO tbl_name (a,b,c) VALUES (4,5,6);
INSERT INTO tbl_name (a,b,c) VALUES (7,8,9);

Insert statement after rewriting

INSERT INTO tbl_name (a,b,c) VALUES (1,2,3),(4,5,6),(7,8,9);

The run-time performance is improved by grouping multiple insert statements as shown in this example. However, there are some restrictions on this method, one of which is that the size of the rewritten sql statement must not exceed the size specified by max_allowed_packet.

Pattern 4) Measurement result

This setting does not seem to affect the operation of JPA (Hibernate), and there was no improvement in performance when using EntitiManager.

No Batch size processing time(Seconds)
First time
processing time(Seconds)
Second time
Remarks
1 No setting 28.258041058 28.514153595 Flush every 30 cases using EntityManager,Execute clear
2 No setting 3.553322863 3.49220706 Reference record using JdbcTemplate

Summary

When the primary key generation method is GenerationType.SEQUENCE, it was confirmed that the performance is improved by setting the batch size (measurement pattern 1), but I got the impression that batch processing takes time in JPA as a whole. It was. I thought that it might be better to implement it by a method other than JPA for processing that handles a large amount of data such as batch processing.

Recommended Posts

About batch processing performance due to difference in primary key generation method in JPA
Difference in date expression due to case inflection
[Android, Java] Convenient method to calculate the difference in days