Re: BUG #17846: pg_dump doesn't properly dump with paused WAL replay

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: francisco(dot)reinolds(at)channable(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17846: pg_dump doesn't properly dump with paused WAL replay
Date: 2023-03-16 20:32:46
Message-ID: CAH2-Wzk9=ri-fSEhhgFMpdan1PX_xJtMj-Ln2zrDO=MKwVQeLg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, Mar 16, 2023 at 8:11 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I really have no idea what's going on there, but can you show the exact
> pg_dump command(s) being issued? I'm particularly curious whether you
> are using parallel dump. The same for the failing pg_restore.
>
> Also, are all the moving parts (primary server, secondary server,
> pg_dump, pg_restore) exactly the same PG version?

I have heard multiple internal reports of incorrect hint bits being
set on standbys where exported snapshots are used (a capability first
added in 2017, by commit 6c2003f8). These were cases that didn't
involve pg_dump at all, though; they involved a third party utility
that happens to use exported snapshots to parallelize a process that
synchronizes a remote database system (not a Postgres database) with
the user's Postgres database. This was also a 13 database, though I
believe we've seen it on an 11 database too. Both systems had
suspiciously similar symptoms, and both used this utility that exports
snapshots.

I never got to the bottom of the problem despite spending some time on
it. I never personally had the opportunity to directly examine the
incorrectly set hint bits on the standby. However, I am quite
confident that spuriously set hint bits were involved. A coworker had
the opportunity to examine affected pages forensically at one point.
They clearly demonstrated incorrectly set hint bits on affected
standbys. The original user visible symptom was duplicate entries in
unique indexes on affected standbys, that came and went sporadically.

This is quite difficult to debug, since all it takes is an FPI on the
primary to "fix" the issue on the affected standby (actually there are
a couple of other things that could do it, like freezing, but FPIs
seem to be most likely). As you can imagine, there are various
practical constraints on accessing affected systems. I rate the
chances of this being due to some undiscovered bug in this area as
high.

--
Peter Geoghegan

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2023-03-16 23:57:34 BUG #17850: could not load library "$libdir/postgis-2.4" could not load library "$libdir/rtpostgis-2.4":
Previous Message PG Bug reporting form 2023-03-16 20:24:29 BUG #17849: python3-etcd Missing from the postgres common repo for RHEL8