incorrect wal removal due to max_slot_wal_keep_size

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: incorrect wal removal due to max_slot_wal_keep_size
Date: 2024-10-09 17:29:53
Message-ID: CAMkU=1zvU1HjCighsRu3Xqo4tQBsyWhj0NySsx7D0i6zLsyomA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I was testing logical replication over my (remarkably bad) wifi network to
see what kind of throughput and lag I would get. I was using pgbench
default transaction as the workload generator with all 4 tables being
replicated. I had synchronous replication configured by
synchronous_standby_names, except at the time it was not actually in use
due to synchronous_commit being set to 'local' on the benchmarking
connections.

The master was shutdown cleanly with a 'smart shutdown request' (in a state
where substantial lag had accumulated--I don't know exactly how much but at
least 20,000 transaction had replayed after replication restarted before it
stalled) when I got distracted by other things and decided to reboot the
ubuntu machine it was running on.

When I restarted the master PostgreSQL server, the replica started to catch
up, but then eventually stalled.

On the master, I had this log, which occurred right after the first
checkpoint (since the server restart) began.:

4790 00000 2024-10-09 12:03:12.819 EDT LOG: invalidating obsolete
replication slot "sub"
4790 00000 2024-10-09 12:03:12.819 EDT DETAIL: The slot's restart_lsn
1/84C5B510 exceeds the limit by 37374704 bytes.
4790 00000 2024-10-09 12:03:12.819 EDT HINT: You might need to increase
"max_slot_wal_keep_size".

But max_slot_wal_keep_size was set to -1 and had never been set to anything
other than that!

The master was running 18devel-d94cf5ca7f. Not for any particular reason,
but just because that is what I happened to have on when I started mucking
around with this. I don't recall running this particular test in this
manner before, and have no reason to think it is only broken in 18dev.

I'm going to try to reproduce this on 17.0, but in the meantime any other
suggestions for investigating this?

I have noticed some previous similar complaints about
max_slot_wal_keep_size being incorrectly invoked, but it didn't look like
they were ever resolved.

Cheers,

Jeff

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2024-10-09 18:30:20 Re: Remove deprecated -H option from oid2name
Previous Message Masahiko Sawada 2024-10-09 17:21:31 Re: Add contrib/pg_logicalsnapinspect