Quick Links

Re: FSM corruption and standby servers

From:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Hunley, Douglas" <douglas(dot)hunley(at)openscg(dot)com>, Tim Goodaire <tgoodaire(at)dyn(dot)com>, pgsql-admin <pgsql-admin(at)postgresql(dot)org>
Subject:	Re: FSM corruption and standby servers
Date:	2016-10-31 17:27:44
Message-ID:	CAKFQuwZkbH9r9bEp6X+9JjE8Q9mcXKDW0sjEJhAX3xBTa-jgGQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-admin

On Mon, Oct 31, 2016 at 9:55 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> "Hunley, Douglas" <douglas(dot)hunley(at)openscg(dot)com> writes:
> > On Mon, Oct 31, 2016 at 10:38 AM, Tim Goodaire <tgoodaire(at)dyn(dot)com>
> wrote:
> >> I have a question regarding the FSM corruption bug that is fixed in
> >> postgresql 9.5.5 (https://wiki.postgresql.org/
> wiki/Free_Space_Map_Problems).
> >> If I don't find any corruption on a master database, is it still
> possible
> >> that there is corruption on the standbys?
>
> > It shouldn't be, iirc. FSMs are only ever created/updated by vacuum,
> which
> > doesn't run on a slave until it is promoted to a master.
>
> The problem is that the WAL data can be wrong in these cases, and since
> the standbys only know what they were told in the WAL stream, their images
> will be wrong even if the master is valid.
>
> I would have thought that the referenced page is clear enough about
> needing to check the standbys; do you think it isn't?
>

I can see how the following is a bit loose for someone not super-familiar
with WAL.

"A database crash-and-restart shortly after such an event can lead to
corrupted FSMs. Also, standby servers will receive incorrect WAL data
causing them to create corrupted FSMs locally."

I believe the "shortly" here is present because the crash must occur before
the next checkpoint in order for the problem to appear on the master.
Given this constraint the secondary emphasis that standby servers receive
seems mis-placed. The most probable scenario - given the bug has
manifested and one is running a standby - is a broken standby and a
functioning master.

"Standby servers are directly impacted by this bug and must be checked for
corruption even if their master appears clean. The master will only
exhibit a problem if there is a crash-and-restart cycle shortly after (up
until a checkpoint) the problem statement that causes the master to replay
the just generated WAL."

It is not clear to what extent traditional backups (in the realm of using
pg_basebackup) are affected...

David J.

In response to

Re: FSM corruption and standby servers at 2016-10-31 16:55:36 from Tom Lane

Responses

Re: FSM corruption and standby servers at 2016-10-31 18:19:24 from Tom Lane

Browse pgsql-admin by date

	From	Date	Subject
Next Message	Tom Lane	2016-10-31 18:19:24	Re: FSM corruption and standby servers
Previous Message	Tim Goodaire	2016-10-31 17:19:12	Re: FSM corruption and standby servers