temp table on commit delete rows performance issue

From: Floris Van Nee <florisvannee(at)Optiver(dot)com>
To: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: temp table on commit delete rows performance issue
Date: 2024-07-16 11:47:06
Message-ID: 365ab92f507e4ec6addef4286782fe71@Optiver.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

I'm looking for some input on an issue I've observed. A common pattern
I've seen is using temporary tables to put data in before updating the
real tables. Something roughly like:

On session start:
CREATE TEMP TABLE temp_t1 (...) ON COMMIT DELETE ROWS;

On update:
BEGIN;
COPY temp_t1 FROM STDIN (FORMAT BINARY);
INSERT INTO t1 (SELECT * FROM temp_t1 ...) ON CONFLICT DO UPDATE SET ...;
-- potentially some other operations on temp table to put data into real table t1
COMMIT;

This pattern starts to break down under certain exceptional circumstances of
high concurrency. The "ON COMMIT DELETE ROWS" does a truncate that is
fairly expensive and doesn't work well in high-concurrency scenarios. It's
especially noticeable under following circumstances:
- high max_connections setting
- high number of temp tables per session
- concurrent writers at fairly short intervals
Impact is on both TPS on primary as well as that the WAL replay process
on replica becomes completely overloaded (100% cpu even though not
a lot of WAL is being generated)

A very simple pgbench example that showcases degradation (taken
with max_connections=2000 to clearly show it).

test_truncate.sql

begin;
set client_min_messages=warning;
create temp table if not exists ut0 (a int, b text, c text) on commit delete rows;
commit;

test_no_truncate.sql

begin;
set client_min_messages=warning;
create temp table if not exists ut0 (a int, b text, c text); -- do not do anything on commit
commit;

pgbench -n -f test_truncate.sql -T 10 postgres -c 1 -j 1
pgbench (18devel)
tps = 2693.806376 (without initial connection time)

pgbench -n -f test_truncate.sql -T 10 postgres -c10 -j 10
pgbench (18devel)
tps = 12119.358458 (without initial connection time)

pgbench -n -f test_no_truncate.sql -T 10 postgres -c 1 -j 1
pgbench (18devel)
tps = 22685.877281 (without initial connection time)

pgbench -n -f test_no_truncate.sql -T 10 postgres -c10 -j 10
pgbench (18devel)
tps = 200359.458154 (without initial connection time)

For the test_truncate.sql with 10 threads, WAL replay process is spinning at 100% CPU.

The reason seems to be that for a large part, the 'on commit delete rows'
follows the same code path as a regular truncate.
This means:
* taking an exclusive lock on the temp table (WAL logged)
* invalidating rel cache (WAL logged)

On replica process, these exclusive locks are very expensive to replay,
leading to spending replica most of its cycles trying to get LWLocks on non-fast-path.
On primary, degradation on concurrency I believes comes from the rel cache invalidation.

There doesn't seem to be a good reason for this behavior, as the replica
has nothing to do for a temporary table truncation, so this lock doesn't
need to be WAL logged.
The generated WAL is just fully this stuff. LOCK+INVALIDATION+COMMIT.
And it's spinning at 100% CPU for trying to replay this.

rmgr: Standby len (rec/tot): 42/ 42, tx: 8154875, lsn: 0/3AFFE270, prev 0/3AFFE218, desc: LOCK xid 8154875 db 5 rel 18188
rmgr: Standby len (rec/tot): 42/ 42, tx: 8154875, lsn: 0/3AFFE2A0, prev 0/3AFFE270, desc: LOCK xid 8154875 db 5 rel 18191
rmgr: Standby len (rec/tot): 42/ 42, tx: 8154875, lsn: 0/3AFFE2D0, prev 0/3AFFE2A0, desc: LOCK xid 8154875 db 5 rel 18192
rmgr: Transaction len (rec/tot): 62/ 62, tx: 8154875, lsn: 0/3AFFE300, prev 0/3AFFE2D0, desc: INVALIDATION ; inval msgs: relcache 18191 relcache 18192
rmgr: Transaction len (rec/tot): 82/ 82, tx: 8154875, lsn: 0/3AFFE340, prev 0/3AFFE300, desc: COMMIT 2024-07-16 12:56:42.878147 CEST; inval msgs: relcache 18191 relcache 18192
rmgr: Standby len (rec/tot): 42/ 42, tx: 8154876, lsn: 0/3AFFE398, prev 0/3AFFE340, desc: LOCK xid 8154876 db 5 rel 18188
rmgr: Standby len (rec/tot): 42/ 42, tx: 8154876, lsn: 0/3AFFE3C8, prev 0/3AFFE398, desc: LOCK xid 8154876 db 5 rel 18191
rmgr: Standby len (rec/tot): 42/ 42, tx: 8154876, lsn: 0/3AFFE3F8, prev 0/3AFFE3C8, desc: LOCK xid 8154876 db 5 rel 18192
rmgr: Transaction len (rec/tot): 62/ 62, tx: 8154876, lsn: 0/3AFFE428, prev 0/3AFFE3F8, desc: INVALIDATION ; inval msgs: relcache 18191 relcache 18192
rmgr: Transaction len (rec/tot): 82/ 82, tx: 8154876, lsn: 0/3AFFE468, prev 0/3AFFE428, desc: COMMIT 2024-07-16 12:56:42.878544 CEST; inval msgs: relcache 18191 relcache 18192

Have people observed this behavior before? Would the community be
open to accepting a patch that changes the behavior of ON COMMIT ROWS
to not WAL log the lock taken and/or the relcache inval?
I think:
* An exclusive lock needs to be taken on primary, but does not
need to be WAL logged
* Rel cache invalidation should not be necessary I think (it currently
happens just on the TOAST table+index, not on the regular TEMP table)

-Floris

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Yasir Shah 2024-07-16 11:57:37 Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
Previous Message Yasir 2024-07-16 11:46:46 ActiveState Perl is not valid anymore to build PG17 on the Windows 10/11 platforms, So Documentation still suggesting it should be updated