I wrote an article like this before.

** Failover support when using connection pooling to connect to an Amazon Aurora cluster **

At this time, I used DBCP2 for the connection pool because I was migrating from the on-premise Tomcat 8 + MySQL 5.5 environment to the Tomcat 8 + Aurora (MySQL compatible) cluster on AWS EC2, but there were various problems and adjustment was difficult.

This time, I will replace this with HikariCP (2.7.6) and see how it changes.

Problems with Apache DBCP2 connection

In a nutshell, it's ** "just slow" **. In particular, "the process of acquiring (borrowing) a connection from the pool is slow".

for that reason,

It takes time to reconnect during Aurora failover
In order to be able to reconnect to the Writer instance after failover, it is necessary to make settings related to Eviction of abnormal connections, but this becomes an operational overhead and further slows down the entire process.
Under heavy load, the process gets stuck when it is necessary to increase the number of connections in the pool (such as when some slow SQL occupies the connection in the pool for a long time), and the connection The number does not increase

It is easy to get into such a situation.

What is Hikari CP

It is a connection pool that is fast, simple, and highly reliable.

HikariCP. It's faster.
brettwooldridge/HikariCP（GitHub）

The introduction page exemplifies how to embed in Maven.

The version shown may be out of date, so check README.md on GitHub.

I think that many people use it in combination with Hibernate, so I will omit the detailed explanation.

Replaced from DBCP2 and tested

In a certain Web application (running on Tomcat 8), use JNDI DataSource Factory and try to incorporate it in Driver-based format.

Before replacement (DBCP2)

`DBCP2 connection`


<Context reloadable="true" crossContext="true">
        <Resource
                name="jdbc/[Connection pool name]"
                auth="Container"
                type="javax.sql.DataSource"
                driverClassName="com.mysql.jdbc.Driver"
                factory="org.apache.commons.dbcp.BasicDataSourceFactory"
                url="jdbc:mysql://[Cluster endpoint]/[DB name]?useUnicode=true&amp;characterEncoding=utf8&amp;connectTimeout=10000&amp;socketTimeout=180000&amp;alwaysSendSetIsolation=false&amp;cacheServerConfiguration=true&amp;elideSetAutoCommits=true&amp;useLocalSessionState=true"
                username="[Connected user name]"
                password="[Connection password]"
                maxTotal="300"
                maxIdle="150"
                minIdle="10"
                initialSize="10"
                testOnBorrow="true"
                validationQuery="select concheck.validation()"
                validationQueryTimeout="10"
                timeBetweenEvictionRunsMillis="60000"
                maxWait="10000" />
</Context>

After replacement (HikariCP)

`Hikari CP connection`


<Context reloadable="true" crossContext="true">
        <Resource
                name="jdbc/[Connection pool name]"
                auth="Container"
                type="javax.sql.DataSource"
                driverClassName="com.mysql.jdbc.Driver"
                factory="com.zaxxer.hikari.HikariJNDIFactory"
                jdbcUrl="jdbc:mysql://[Cluster endpoint]/[DB name]?useUnicode=true&amp;characterEncoding=utf8&amp;connectTimeout=10000&amp;socketTimeout=180000&amp;alwaysSendSetIsolation=false&amp;cacheServerConfiguration=true&amp;elideSetAutoCommits=true&amp;useLocalSessionState=true"
                dataSource.implicitCachingEnabled="true"
                dataSource.user="[Connected user name]"
                dataSource.password="[Connection password]"
                minimumIdle="10"
                maximumPoolSize="300"
                connectionTimeout="10000"
                idleTimeout="180000"
                maxLifetime="180000"
                connectionInitSql="select concheck.validation()"
                validationTimeout="9800" />
</Context>

Since ** connectionTestQuery ** is deprecated for the confirmation query for failover support (see the above article for details) * It is written in * connectionInitSql **. Even if you write it here, if an error is returned (= you have connected to the Reader instance in read / write mode), it will close the connection and reconnect.

Comparison

JMeter sent 3,000 requests (750 threads) per minute to a web application for 45 minutes. Then, I tried to extract three types of raw (when executed independently) processing time: short / medium / long, out of several screen requests in the Web application.

The unit is milliseconds. In addition, ** Since the amount of data varies greatly for each thread (user), the SQL processing time of the "long" screen in particular varies even in the raw state ** (up to 8.5 seconds). Other conditions are as follows.
4 web servers (EC2 r4.large / Amazon Linux 2017/09 version / 2 for each AZ / Tomcat 8.0.47)
Aurora has db.r4.2xlarge as the main DB, 2 clusters / 1 cluster Writer / Reader / Ver.1.16
Results are aggregated by JMeter (2 JMeter clients)

** DBCP2 results **

Request (screen)	Average value	Median	90%value	最小value	最大value
Short-time processing screen	60	49	78	24	22,832
Medium screen	234	177	332	86	21,607
Long-time processing screen	728	171	1,827	45	24,557

** Hikari CP results **

Request (screen)	Average value	Median	90%value	最小value	最大value
Short-time processing screen	46	45	57	22	347
Medium screen	172	158	242	83	3,087
Long-time processing screen	676	146	1,703	41	9,524

Hikari CP is fast overall, but ** notable is the maximum **.

In the case of DBCP2, the short / medium / long range is in the 20 second range. This is not because the processing of the DB server (Aurora) slows down when SQL is flowed in parallel, but there is a waiting time (affected by the processing of other threads) to acquire (borrow) a connection from the pool. Indicates that you are doing.

On the other hand, in the case of HikariCP, if the number of requests is about this level, the response is returned within the range of ** "plain SQL processing time + α" **.

Failover behavior

With DBCP2, when the operating load of the application becomes high, the reconnection after failover tends to be blocked immediately. When failover was executed with HikariCP under the above-mentioned load, reconnection was performed in a short time as shown in the figure below.

Summary

The difference is not so noticeable when the load is low, but in the case of a system with a certain load, the performance changes greatly depending on the connection pool, which also increases the capacity of the number of connections (scaleable range) and the stability of failover operation. It turns out that it will change a lot.

In particular, when migrating from on-premise to AWS environment or migrating only DB to Aurora, communication delay (latency) from the client that issues SQL and Web server becomes large ** "What problem with DBCP2 before? I didn't have any, but as soon as I migrated, the number of connections did not scale well and got stuck. ”** and other events tend to occur, so I think it is necessary to carefully verify when migrating.

** Amazon Aurora related articles posted on Qiita **

Use Hikari CP for pooling connections to Amazon Aurora clusters