I didn't design or develop it, and I could barely divide the whole thing by participating in the review, but I don't have time to look at the source from now on, and the person who developed it is not aware of the performance aspect. ** How did you deal with the dark situation where you suddenly blew fire in terms of performance and wanted to do something in a couple of hours after being summoned **? Let's write it in% **.
Click here for the first and second parts. Defense Techniques When You Have to Fight the Performance of Unfamiliar Applications (Part 1) https://qiita.com/nfujita55a/items/3760cd099ca890f5a4d4 Defense Techniques When You Have to Fight the Performance of Unfamiliar Applications (Part 2) https://qiita.com/nfujita55a/items/555350e61b73db3a2b8c
** This second part is an introduction of tools (DBMS-PostgreSQL is an example). **, I specified in the first part ** I wrote what happened after the process that this was trying to do was identified as PostgreSQL **.
…… I think there are some strange things, so I'm looking forward to pointing out.
** First, log in to the PostgtrSQL server that is experiencing the performance issue and run vmstat. ** **
For example, on a 4-core server, if the partition is cut by "/" and it looks like this,
$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system------ ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 2 1396644 138700 7968 6263976 0 0 35840 4096 2583 579 40 1 56 2 0
1 1 1396644 125812 7964 6249224 0 0 5676 3584 18565 32282 11 9 55 25 0
1 1 1396752 146988 7952 6226776 0 0 5644 4204 26412 47005 12 11 56 21 0
1 1 1396752 139160 7952 6235068 0 0 7168 4096 25600 45379 11 11 55 23 0
1 1 1396752 131488 7964 6242316 0 0 6896 65224 22011 38559 12 10 51 26 0
0 2 1396752 127456 7964 6246276 0 0 2980 4096 9311 14309 20 4 52 24 0
2 1 1396752 120512 7964 6253204 0 0 6444 3080 13579 22878 8 7 61 25 0
0 2 1396852 131960 7964 6241440 0 0 15760 4196 14203 23189 9 7 64 21 0
2 1 1396852 125560 7964 6248624 0 0 6216 3584 11858 19502 8 6 63 24 0
2 2 1396852 137848 7956 6227964 0 0 6516 9180 7363 7245 23 3 50 25 0
1 1 1396852 142536 7944 6197020 0 0 8552 4112 7081 3443 24 4 41 32 0
Therefore, it can be judged that the disc is the bottleneck (this flow is also common).
Next, I want to know why the DBMS is doing things like reading and writing to the disk, so whether the disk is a bottleneck or not ** pg_stat_activity, "What is the query currently issued to PostgreSQL?" ”Is checked. ** **
$ psql -U hoge hogedb
> select * from pg_stat_activity
datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | wait_event_type | wait_event | state | backend_xid | backend_xmin | query | backend_type
-------+----------+-------+----------+----------+------------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+---------------------+--------+-------------+--------------+----------------------------------------------------------------------------------------------------------------+---------------------
| | 8192 | | | | | | | 2018-11-15 17:50:25.242567+09 | | | | Activity | AutoVacuumMain | | | | | autovacuum launcher
| | 8194 | 10 | postgres | | | | | 2018-11-15 17:50:25.242603+09 | | | | Activity | LogicalLauncherMain | | | | | background worker
16385 | hogedb | 23061 | 16384 | user01 | PostgreSQL JDBC Driver | 10.10.0.1 | | 34892 | 2020-12-25 22:41:15.277531+09 | 2020-12-25 22:41:21.194053+09 | 2020-12-25 22:41:27.858717+09 | 2020-12-25 22:41:27.858717+09 | IO | DataFileRead | active | 8542532 | 8542532 | insert into table1 (c1, c2, c3, c4) values ($1,$2,$3,$4) ON CONFLICT ON CONSTRAINT table1_pkey DO NOTHING | client backend
16385 | hogedb | 22372 | | | | | | | 2020-12-25 22:14:54.574566+09 | 2020-12-25 22:14:54.596588+09 | 2020-12-25 22:14:54.596588+09 | 2020-12-25 22:14:54.596589+09 | | | active | | 8542473 | autovacuum: VACUUM public.table1 | autovacuum worker
16385 | hogedb | 23063 | 16384 | user02 | PostgreSQL JDBC Driver | 10.10.0.2 | | 34896 | 2020-12-25 22:41:24.512955+09 | 2020-12-25 22:41:24.559247+09 | 2020-12-25 22:41:27.858132+09 | 2020-12-25 22:41:27.858132+09 | IO | DataFileRead | active | 8542533 | 8542532 | select c1, c2, c3, c4 from table1 where c2 <= $1 and c3 = $2 order by c1 desc | client backend
16385 | hogedb | 23067 | 16385 | user02 | PostgreSQL JDBC Driver | 10.10.0.2 | | 34825 | 2020-12-25 22:40:24.922965+09 | 2020-12-25 22:40:24.922965+09 | 2020-12-25 22:40:25.120965+09 | 2020-12-25 22:41:27.120965+09 | IO | DataFileRead | active | 8542534 | 8542532 | select c1, c2, c3, c4 from table1 where c2 <= $1 and c3 = $2 order by c1 desc | client backend
16385 | hogedb | 23069 | 16385 | user02 | PostgreSQL JDBC Driver | 10.10.0.2 | | 34814 | 2020-12-25 22:42:31.154757+09 | 2020-12-25 22:42:31.154757+09 | 2020-12-25 22:42:31.545407+09 | 2020-12-25 22:41:27.545407+09 | IO | DataFileRead | active | 8542535 | 8542532 | select c1, c2, c3, c4 from table1 where c2 <= $1 and c3 = $2 order by c1 desc | client backend
16385 | hogedb | 23072 | 16385 | user02 | PostgreSQL JDBC Driver | 10.10.0.2 | | 34925 | 2020-12-25 22:43:12.272667+09 | 2020-12-25 22:43:12.272667+09 | 2020-12-25 22:43:12.213544+09 | 2020-12-25 22:41:27.213544+09 | IO | DataFileRead | active | 8542536 | 8542532 | select c1, c2, c3, c4 from table1 where c2 <= $1 and c3 = $2 order by c1 desc | client backend
16385 | hogedb | 23064 | | | | | | | 2020-12-25 22:41:25.730477+09 | 2020-12-25 22:41:25.752676+09 | 2020-12-25 22:41:25.752676+09 | 2020-12-25 22:41:25.752677+09 | | | active | | 8542532 | autovacuum: VACUUM ANALYZE public.table2 | autovacuum worker
13806 | postgres | 23066 | 10 | postgres | psql | | | -1 | 2020-12-25 22:41:26.642323+09 | 2020-12-25 22:41:27.86806+09 | 2020-12-25 22:41:27.86806+09 | 2020-12-25 22:41:27.868105+09 | | | active | | 8542532 | select * from pg_stat_activity ; | client backend
| | 8190 | | | | | | | 2018-11-15 17:50:25.208832+09 | | | | Activity | BgWriterMain | | | | | background writer
| | 8189 | | | | | | | 2018-11-15 17:50:25.208563+09 | | | | Activity | CheckpointerMain | | | | | checkpointer
| | 8191 | | | | | | | 2018-11-15 17:50:25.209101+09 | | | | Activity | WalWriterMain | | | | | walwriter
What you see is a query whose state is active. Again, run "select * from pg_stat_activity" multiple times to see if there is any SQL that is always active.
Refer to the SQL client_addr (connection source IP) and query_start (query start time) that always appear.
Identifies strange SQL such as.
Also check the wait_event_type and wait_event for the SQL in question. In this example
So you can see that ** Read from disk is a bottleneck **. In this way, ** look at both the OS and DBMS to back up the bottleneck part **.
Next comes the EXPLAIN statement. It's like this. It is a SQL execution plan for a batch process (I changed it)
psql -U hoge hogedb
>explain Problem query(Moderately long)
Sort (cost=559316.53..559316.58 rows=19 width=8)
Sort Key: (function1(d.foo_count))
-> HashAggregate (cost=559311.19..559316.13 rows=19 width=8)
-> Result (cost=556681.39..559310.69 rows=100 width=8)
One-Time Filter: ('2020-12-09'::date = '2020-12-09 00:00:00+09'::timestamp with time zone)
-> Nested Loop (cost=556681.39..559285.69 rows=100 width=8)
-> Merge Join (cost=556680.82..558795.11 rows=108 width=21)
Merge Cond: ((l.bar_id = r.bar_id) AND (l.huga_index = r.index))
-> Sort (cost=62831.09..63392.21 rows=224450 width=12)
Sort Key: l.bar_id, l.huga_index
-> Nested Loop (cost=40.44..42881.93 rows=224450 width=12)
-> HashAggregate (cost=40.00..42.00 rows=200 width=4)
-> Seq Scan on t1 (cost=0.00..34.00 rows=2400 width=4)
-> Index Scan using ix_prefix_bazlog9c on prefix_bazlog9 l (cost=0.44..212.71 rows=149 width=16)
Index Cond: ((baz_id = t1.id) AND (bar_id = ANY ('{1234,1235,1236,1237}'::integer[])))
Filter: (piyo_time < '2020-12-19 00:00:00+09'::timestamp with time zone)
-> Sort (cost=493848.72..493993.02 rows=57719 width=25)
Sort Key: r.bar_id, r.index
-> Bitmap Heap Scan on prefix_barlog9 r (cost=386233.76..489284.08 rows=57719 width=25)
Recheck Cond: ((cust_foo_date = '2020-12-09 00:00:00+09'::timestamp with time zone) AND ((customer_send_type)::text = '1'::text) AND (bar_id = ANY ('{1234,1235,1236,1237}'::integer[])))
-> Bitmap Index Scan on ix_prefix_barlog9a (cost=0.00..386219.34 rows=57719 width=0)
Index Cond: ((cust_foo_date = '2020-12-09 00:00:00+09'::timestamp with time zone) AND ((customer_send_type)::text = '1'::text) AND (bar_id = ANY ('{1234,1235,1236,1237}'::integer[])))
-> Index Scan using foo_data_pkey_4 on foo_data d (cost=0.57..4.53 rows=1 width=17)
Index Cond: ((foo_data = '2020-12-09 00:00:00+09'::timestamp with time zone) AND ((foo_type)::text = '1'::text) AND ((baz_id)::text = (r.baz_id)::text))
The EXPLAIN statement does not execute the query itself, only the execution plan is displayed. The EXPLAIN ANALYZE statement executes the query and displays both its duration and execution plan.
The execution plan given above is fairly long, but if you apply the actual batch processing SQL to the execution plan, ** this length is often the case. Look at it without fail **
I will not explain how to read the execution plan here (there are various explanations already), but the common bad pattern is
Is not it.
$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system---- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
23 0 1396644 138700 7968 6263976 0 0 12 96 2583 579 98 0 2 0 0
19 0 1396644 125812 7964 6249224 0 0 245 84 18565 32282 97 0 3 0 0
24 0 1396752 146988 7952 6226776 0 0 644 4 26412 47005 98 0 2 0 0
52 0 1396752 139160 7952 6235068 0 0 168 96 25600 45379 99 0 1 0 0
54 0 1396752 131488 7964 6242316 0 0 896 24 22011 38559 95 3 2 0 0
23 0 1396752 127456 7964 6246276 0 0 980 16 9311 14309 98 0 2 0 0
No, before pg_stat_actibity or explain, let's set up auto_explain properly, that's right. If possible, be sure to set up auto_explain when you build it. Then, the query that took more than XX seconds will be output together with the execution plan.
Reference) Output information of PostgreSQL auto_explain module https://qiita.com/bwtakacy/items/416dda4f3c4eb99cd006
If you see a clear missing Index ** and you're on the verge of a site crash or a batch is delayed or urgent, create a ** Index on the fly. ** **
When I create an index, it waits for other queries to try to update that table,
CREATE INDEX CONCURRENTLY sales_quantity_index ON sales_table (quantity)
And ** "CONCURRENTLY" will create an Index without acquiring a lock **. CONCURRENTLY has the risk of taking a long time to create, and possibly failing to create it, but it's much better than ** an exclusive lock on the table for updates in an unfamiliar application **. I will.
Reference) Simultaneous index creation https://www.postgresql.jp/document/12/html/sql-createindex.html#SQL-CREATEINDEX-CONCURRENTLY
I personally call it "No good former ANALYZE"
In that case, the statistical information (distribution information of the table data held by the DBMS) may be out of date and the execution plan may be inefficient.
For PostgreSQL, you can see when the statistics for the table pg_stat_all_tables were retrieved.
$ psql -U hoge hogedb
> \x
Expanded display is on.
> select * from pg_stat_all_tables where relname = 'table name';
-[ RECORD 1 ]-------+------------------------------
relid | 16392
schemaname | public
relname | hogehoge
seq_scan | 772
seq_tup_read | 14177993421
idx_scan | 44891175496
idx_tup_fetch | 31436711971
n_tup_ins | 13813258940
n_tup_upd | 0
n_tup_del | 13776931027
n_tup_hot_upd | 0
n_live_tup | 34048552
n_dead_tup | 1050799
n_mod_since_analyze | 684275
last_vacuum |
last_autovacuum | 2020-12-25 23:33:31.290387+09
last_analyze |
last_autoanalyze | 2020-12-25 23:38:07.036503+09
vacuum_count | 0
autovacuum_count | 4687
analyze_count | 0
autoanalyze_count | 15111
if,
In that case, first
analyze verbose table name
It is worth trying to improve.
$ psql -U hoge hogedb
analyze verbose
It is possible to re-acquire statistics for all tables in, but I want to avoid this (exception: when automatic statistics acquisition is set to OFF by mistake).
I know that DB is the bottleneck, but I don't know where to improve it. If vmstat or iostat knows that the disk is the bottleneck, you can choose to improve the disk. ..
Is typical?
Defense against dark magic Advent Calendar 2020 It was an entry on the 25th day, but it was a little late. Thank you for your hard work. Thank you very much.
Recommended Posts