Re: FSM corruption and standby servers

From: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Hunley, Douglas" <douglas(dot)hunley(at)openscg(dot)com>, Tim Goodaire <tgoodaire(at)dyn(dot)com>, pgsql-admin <pgsql-admin(at)postgresql(dot)org>
Subject: Re: FSM corruption and standby servers
Date: 2016-10-31 17:27:44
Message-ID: CAKFQuwZkbH9r9bEp6X+9JjE8Q9mcXKDW0sjEJhAX3xBTa-jgGQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

On Mon, Oct 31, 2016 at 9:55 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> "Hunley, Douglas" <douglas(dot)hunley(at)openscg(dot)com> writes:
> > On Mon, Oct 31, 2016 at 10:38 AM, Tim Goodaire <tgoodaire(at)dyn(dot)com>
> wrote:
> >> I have a question regarding the FSM corruption bug that is fixed in
> >> postgresql 9.5.5 (https://wiki.postgresql.org/
> wiki/Free_Space_Map_Problems).
> >> If I don't find any corruption on a master database, is it still
> possible
> >> that there is corruption on the standbys?
>
> > It shouldn't be, iirc. FSMs are only ever created/updated by vacuum,
> which
> > doesn't run on a slave until it is promoted to a master.
>
> The problem is that the WAL data can be wrong in these cases, and since
> the standbys only know what they were told in the WAL stream, their images
> will be wrong even if the master is valid.
>
> I would have thought that the referenced page is clear enough about
> needing to check the standbys; do you think it isn't?
>

​I can ​see how the following is a bit loose for someone not super-familiar
with WAL.

"A database crash-and-restart shortly after such an event can lead to
corrupted FSMs. Also, standby servers will receive incorrect WAL data
causing them to create corrupted FSMs locally."

I believe the "shortly" here is present because the crash must occur before
the next checkpoint in order for the problem to appear on the master.
Given this constraint the secondary emphasis that standby servers receive
seems mis-placed. The most probable scenario - given the bug has
manifested and one is running a standby - is a broken standby and a
functioning master.​

"Standby servers are directly impacted by this bug and must be checked for
corruption even if their master appears clean. The master will only
exhibit a problem if there is a crash-and-restart cycle shortly after (up
until a checkpoint) the problem statement that causes the master to replay
the just generated WAL."

It is not clear to what extent traditional backups (in the realm of using
pg_basebackup) are affected...

David J.

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Tom Lane 2016-10-31 18:19:24 Re: FSM corruption and standby servers
Previous Message Tim Goodaire 2016-10-31 17:19:12 Re: FSM corruption and standby servers