A note about NVMe performance measurements. Measure NVMe performance using the following three tools:

1, dd command 2, fio tool 3, vdbench tool

OS: CentOS 7.7 NVMe: Micron 9100　3.2TB CPU: Xeon Gold 5117M 14Core x 2 Memory: 32GB 2400 x6 System:1029U-TN10RT(Supermicro)

Sequential performance with dd command

Write command> Result: 1.7GB / s time dd if=/dev/zero of=/nvme/testfile bs=1024k count=8192 oflag=direct Read command> Result: 2.2GB / s time dd if=/nvme/testfile of=/dev/null bs=1024k iflag=direct

[root@localhost Downloads]# time dd if=/dev/zero of=/nvme/testfile bs=1024k count=8192 oflag=direct
8589934592 bytes (8.6 GB) copied, 4.95339 s, 1.7 GB/s

[root@localhost Downloads]# time dd if=/nvme/testfile of=/dev/null bs=1024k iflag=direct
8589934592 bytes (8.6 GB) copied, 3.85802 s, 2.2 GB/s

[root@localhost Downloads]#

Measured with the famous fio tool

On Ubuntu, you can install it with apt-get install fio.

It seems that CentOS will not bring it with yum, so I borrowed it from RIKEN. If you don't have the libpmem package, you may get an error during installation, so put it in.

#yum install libpmem-devel
#wget http://ftp.riken.jp/Linux/fedora/epel/7/x86_64/Packages/f/fio-3.1-1.el7.x86_64.rpm
#rpm -ivh fio-3.1-1.el7.x86_64.rpm
#fio <Test configuration file>

I made a sample of the test configuration file as follows. Random access is 4K and Queue Depth is deepened Sequential is 1MB and Queue Depth is shallow

Configuration file for random access

File name: random.fio

[global]
bs=4k
ioengine=libaio
iodepth=32
size=1g
numjobs=16
direct=1
refill_buffers=1
runtime=60
directory=/nvme
group_reporting=1
filename=ssd.test.file

[rand-read]
rw=randread
stonewall

[rand-write]
rw=randwrite
stonewall

The result is as fast as below, but since the CPU is almost 100% used, how to control it with the actual application may be a matter of consideration. Random reading: 419,000 IO / S Random writing: 331,000 IO / S

abridgement
rand-read: (groupid=0, jobs=16): err= 0: pid=101145: Tue Apr 21 14:27:33 2020
   read: IOPS=419k, BW=1637MiB/s (1716MB/s)(95.9GiB/60001msec)
  lat (usec) : 50=0.01%, 100=0.01%, 250=0.01%, 500=0.04%, 750=0.30%
  lat (usec) : 1000=8.95%
  lat (msec) : 2=90.27%, 4=0.43%, 10=0.01%, 20=0.01%
  cpu : usr=4.36%, sys=91.72%, ctx=4582969, majf=0, minf=18859
 
rand-write: (groupid=1, jobs=16): err= 0: pid=101170: Tue Apr 21 14:27:33 2020
  write: IOPS=331k, BW=1294MiB/s (1357MB/s)(75.8GiB/60001msec)  
  lat (usec) : 50=0.01%, 100=0.01%, 250=0.01%, 500=0.15%, 750=0.35%
  lat (usec) : 1000=0.30%
  lat (msec) : 2=97.12%, 4=2.04%, 10=0.04%
  cpu : usr=5.20%, sys=92.32%, ctx=1623406, majf=0, minf=13469

Sequential configuration file

File name: seq.fio

[global]
bs=1024k
ioengine=libaio
iodepth=4
size=1g
numjobs=4
direct=1
refill_buffers=1
runtime=60
directory=/nvme
group_reporting=1
filename=ssd.test.file

[seq-read]
rw=read
stonewall

[seq-write]
rw=write
stonewall

The result is also fast, but it doesn't use much CPU, so it shouldn't be a problem for streaming or file server. Sequential reading: 3081MB / s Sequential writing: 2235MB / s

seq-read: (groupid=0, jobs=4): err= 0: pid=101279: Tue Apr 21 14:32:02 2020
   read: IOPS=2938, BW=2939MiB/s (3081MB/s)(172GiB/60004msec)
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.08%, 4=19.51%, 10=79.91%, 20=0.43%, 50=0.08%
  cpu          : usr=0.34%, sys=25.11%, ctx=142386, majf=0, minf=13361

seq-write: (groupid=1, jobs=4): err= 0: pid=101291: Tue Apr 21 14:32:02 2020
  write: IOPS=2131, BW=2132MiB/s (2235MB/s)(125GiB/60006msec)
  lat (usec)   : 750=0.01%, 1000=0.05%
  lat (msec)   : 2=2.48%, 4=15.78%, 10=74.68%, 20=7.00%
  cpu          : usr=23.27%, sys=10.63%, ctx=71113, majf=0, minf=16003

Measure RAW devices directly with vdbench

What if you read / write directly to the device instead of the file system? Download it here https://www.oracle.com/downloads/server-storage/vdbench-downloads.html Apart from that, Java (jre) is required, so the following is also dropped. I think the latest one is fine, but I always use the one with a proven track record because Java related things are always troublesome. jre-7u1-linux-x64.rpm

#rpm -ivh jre-7u1-linux-x64.rpm
#unzip vdbench.zip
#cd vdbench
#./vdbench -f <setting file>.prm

The configuration file random.prm for testing is as follows. seekpct = 0 is a percentage of random seek, so 100 is random and 0 is sequential. It's hard to understand.

compratio=1
*
sd=s1,lun=/dev/nvme0n1,align=4096,openflags=o_direct
*
wd=wd1,sd=(s1),xfersize=4KB,seekpct=100,rdpct=100
wd=wd1,sd=(s1),xfersize=4KB,seekpct=100,rdpct=0
*
rd=rd1,wd=wd*,iorate=max,forthreads=32,elapsed=300,interval=1

Random read execution result (workload: 1) Average: 219081.86 IO / s

Apr 21, 2020  interval        i/o   MB/sec   bytes   read     resp     read    write     resp     resp queue  cpu%  cpu%
                             rate  1024**2     i/o    pct     time     resp     resp      max   stddev depth sys+u   sys

18:06:23.050        33  188037.00   734.52    4096 100.00    0.024    0.024    0.000    0.971    0.000   4.4   9.0   3.3
18:06:24.054        34  216224.00   844.63    4096 100.00    0.024    0.024    0.000    1.337    0.000   5.2  10.3   4.3
18:06:25.046        35  217075.00   847.95    4096 100.00    0.024    0.024    0.000    1.878    0.000   5.3  10.1   4.0
18:06:26.054        36  224799.00   878.12    4096 100.00    0.024    0.024    0.000    2.129    0.000   5.5  10.5   4.3
18:06:50.048  avg_2-60  219081.86   855.79    4096 100.00    0.024    0.024    0.000    4.086    0.000   5.2   9.9   3.9

Random write execution result (workload: 1) Average: 263525.95 IO / s

Apr 21, 2020  interval        i/o   MB/sec   bytes   read     resp     read    write     resp     resp queue  cpu%  cpu%
                             rate  1024**2     i/o    pct     time     resp     resp      max   stddev depth sys+u   sys
18:16:46.088         1  155367.00   606.90    4096   0.00    0.033    0.000    0.033    0.196    0.000   5.2   6.6   1.6
18:16:47.012         2  194912.00   761.38    4096   0.00    0.033    0.000    0.033    3.979    0.000   6.3  11.2   3.8
18:16:48.052         3  274287.00  1071.43    4096   0.00    0.041    0.000    0.041    1.344    0.000  11.2  13.2   5.6
18:17:45.061  avg_2-60  263525.95  1029.40    4096   0.00    0.036    0.000    0.036    5.737    0.000   9.6  12.2   4.7

It saves the log in HTML format in the Output directory for easy viewing. You can also see the latency in the histogram.

  min(ms) <     max(ms)        count    %% cum%%  '+': Individual%; '+-': Cumulative%

    0.000 <       0.020    7,578,813  58.6  58.6  +++++++++++++++++++++++++++++
    0.020 <       0.040    4,734,960  36.6  95.3  ++++++++++++++++++-----------------------------
    0.040 <       0.060       90,903   0.7  96.0  -----------------------------------------------
    0.060 <       0.080       20,341   0.2  96.1  -----------------------------------------------
    0.080 <       0.100       15,663   0.1  96.2  ------------------------------------------------
    0.100 <       0.200      483,566   3.7 100.0  +------------------------------------------------

How about measuring with 6 NVMe luxuriously?

The configuration file looks like this.

sd=s1,lun=/dev/nvme2n1,align=4096,openflags=o_direct
sd=s2,lun=/dev/nvme3n1,align=4096,openflags=o_direct
sd=s3,lun=/dev/nvme4n1,align=4096,openflags=o_direct
sd=s4,lun=/dev/nvme5n1,align=4096,openflags=o_direct
sd=s5,lun=/dev/nvme4n1,align=4096,openflags=o_direct
sd=s6,lun=/dev/nvme5n1,align=4096,openflags=o_direct
*
wd=wd1,sd=(s1),xfersize=1024KB,seekpct=0,rdpct=100
wd=wd2,sd=(s2),xfersize=1024KB,seekpct=0,rdpct=100
wd=wd3,sd=(s3),xfersize=1024KB,seekpct=0,rdpct=100
wd=wd4,sd=(s4),xfersize=1024KB,seekpct=0,rdpct=100
wd=wd5,sd=(s4),xfersize=1024KB,seekpct=0,rdpct=100
wd=wd6,sd=(s4),xfersize=1024KB,seekpct=0,rdpct=100
*
rd=rd1,wd=wd*,iorate=max,forthreads=4,elapsed=60,interval=1

The result is, 18GB / s! It is so easy to exceed 10GB / s. .. .. I felt that 4K editing and 8K editing seemed to be cool.

Apr 21, 2020  interval        i/o   MB/sec   bytes   read     resp     read    write     resp     resp queue  cpu%  cpu%
                             rate  1024**2     i/o    pct     time     resp     resp      max   stddev depth sys+u   sys
15:38:59.052        55   18509.00 18509.00 1048576 100.00    1.943    1.943    0.000    2.933    0.000  36.0   2.9   2.2
15:39:00.052        56   18592.00 18592.00 1048576 100.00    1.935    1.935    0.000    4.156    0.000  36.0   2.5   2.2
15:39:01.054        57   18552.00 18552.00 1048576 100.00    1.939    1.939    0.000    2.804    0.000  36.0   2.6   2.3
15:39:02.054        58   18590.00 18590.00 1048576 100.00    1.937    1.937    0.000    2.914    0.000  36.0   2.7   2.3
15:39:03.048        59   19280.00 19280.00 1048576 100.00    1.939    1.939    0.000    2.756    0.000  37.4   2.7   2.3
15:39:04.053        60   17825.00 17825.00 1048576 100.00    1.940    1.940    0.000    2.741    0.000  34.6   2.6   2.3
15:39:04.083  avg_2-60   18524.08 18524.08 1048576 100.00    1.941    1.941    0.000    6.164    0.000  36.0   2.7   2.2

Random performance will also improve as you increase the workload! With Random Read, over 3,500,000 IO / s ...! ?? However, be careful as the CPU will go straight to the red zone. If I raised it too much, I got a Warning.

Apr 21, 2020  interval        i/o   MB/sec   bytes   read     resp     read    write     resp     resp queue  cpu%  cpu%
                             rate  1024**2     i/o    pct     time     resp     resp      max   stddev depth sys+u   sys
15:32:41.052        55 3571729.00 13952.07    4096 100.00    0.050    0.050    0.000   19.540    0.000 179.5  87.8  61.1
15:32:42.052        56 3602147.00 14070.89    4096 100.00    0.050    0.050    0.000   14.770    0.000 180.5  87.6  61.7
15:32:43.051        57 3623401.00 14153.91    4096 100.00    0.050    0.050    0.000   18.074    0.000 181.1  87.7  61.9
15:32:44.051        58 3591990.00 14031.21    4096 100.00    0.050    0.050    0.000   19.873    0.000 179.6  87.6  61.6
15:32:45.051        59 3739079.00 14605.78    4096 100.00    0.050    0.050    0.000   12.645    0.000 186.6  87.7  61.7
15:32:46.053        60 3453817.00 13491.47    4096 100.00    0.050    0.050    0.000   14.438    0.000 172.7  87.8  61.5
15:32:46.096  avg_2-60 3613343.63 14114.62    4096 100.00    0.050    0.050    0.000   29.592    0.000 180.5  87.7  61.8

15:32:46.097 * Warning: average processor utilization 87.68%
15:32:46.097 * Any processor utilization over 80% could mean that your system
15:32:46.097 * does not have enough cycles to run the highest rate possible

NVMe Performance Measurement Summary

Sequential performance with dd command

Measured with the famous fio tool

Configuration file for random access

Sequential configuration file

Measure RAW devices directly with vdbench