From: | "Jamison, Kirk" <k(dot)jamison(at)jp(dot)fujitsu(dot)com> |
---|---|
To: | 'Adrien Nayrat' <adrien(dot)nayrat(at)anayrat(dot)info> |
Cc: | "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | RE: [PATCH] Speedup truncates of relation forks |
Date: | 2019-07-01 10:55:49 |
Message-ID: | D09B13F772D2274BB348A310EE3027C6502672@g01jpexmbkw24 |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wednesday, June 26, 2019 6:10 PM(GMT+9), Adrien Nayrat wrote:
> As far as I remember, you should see "relation" wait events (type lock) on
> standby server. This is due to startup process acquiring AccessExclusiveLock
> for the truncation and other backend waiting to acquire a lock to read the
> table.
Hi Adrien, thank you for taking time to reply.
I understand that RelationTruncate() can block read-only queries on
standby during redo. However, it's difficult for me to reproduce the
test case where I need to catch that wait for relation lock, because
one has to execute SELECT within the few milliseconds of redoing the
truncation of one table.
Instead, I just measured the whole recovery time, smgr_redo(),
to show the recovery improvement compared to head. Please refer below.
[Recovery Test]
I used the same stored functions and configurations in the previous email
& created "test" db.
$ createdb test
$ psql -d test
1. [Primary] Create 10,000 relations.
test=# SELECT create_tables(10000);
2. [P] Insert one row in each table.
test=# SELECT insert_tables(10000);
3. [P] Delete row of each table.
test=# SELECT delfrom_tables(10000);
4. [Standby] WAL application is stopped at Standby server.
test=# SELECT pg_wal_replay_pause();
5. [P] VACUUM is executed at Primary side, and measure its execution time.
test=# \timing on
test=# VACUUM;
Alternatively, you may use:
$ time psql -d test -c 'VACUUM;'
(Note: WAL has not replayed on standby because it's been paused.)
6. [P] Wait until VACUUM has finished execution. Then, stop primary server.
test=# pg_ctl stop -w
7. [S] Resume WAL replay, then promote standby (failover).
I used a shell script to execute recovery & promote standby server
because it's kinda difficult to measure recovery time. Please refer to the script below.
- "SELECT pg_wal_replay_resume();" is executed and the WAL application is resumed.
- "pg_ctl promote" to promote standby.
- The time difference of "select pg_is_in_recovery();" from "t" to "f" is measured.
shell script:
PGDT=/path_to_storage_directory/
if [ "$1" = "resume" ]; then
psql -c "SELECT pg_wal_replay_resume();" test
date +%Y/%m/%d_%H:%M:%S.%3N
pg_ctl promote -D ${PGDT}
set +x
date +%Y/%m/%d_%H:%M:%S.%3N
while [ 1 ]
do
RS=`psql -Atc "select pg_is_in_recovery();" test`
if [ ${RS} = "f" ]; then
break
fi
done
date +%Y/%m/%d_%H:%M:%S.%3N
set -x
exit 0
fi
[Test Results]
shared_buffers = 24GB
1. HEAD
(wal replay resumed)
2019/07/01_08:48:50.326
server promoted
2019/07/01_08:49:50.482
2019/07/01_09:02:41.051
Recovery Time:
13 min 50.725 s -> Time difference from WAL replay to complete recovery
12 min 50.569 s -> Time difference of "select pg_is_in_recovery();" from "t" to "f"
2. PATCH
(wal replay resumed)
2019/07/01_07:34:26.766
server promoted
2019/07/01_07:34:57.790
2019/07/01_07:34:57.809
Recovery Time:
31.043 s -> Time difference from WAL replay to complete recovery
00.019 s -> Time difference of "select pg_is_in_recovery();" from "t" to "f"
[Conclusion]
The recovery time significantly improved compared to head
from 13 minutes to 30 seconds.
Any thoughts?
I'd really appreciate your comments/feedback about the patch and/or test.
Regards,
Kirk Jamison
From | Date | Subject | |
---|---|---|---|
Next Message | Dean Rasheed | 2019-07-01 11:02:28 | Re: Choosing values for multivariate MCV lists |
Previous Message | Thomas Munro | 2019-07-01 10:50:35 | Re: Optimization of some jsonb functions |