From: | Noah Yetter <nyetter(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)2ndquadrant(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: all_visible replay aborting due to uninitialized pages |
Date: | 2013-11-11 00:40:31 |
Message-ID: | CAPuoA+nYm_DHtBcFsPXkq3wKR3kv9yb3H4VD8JVW1sjH4kBpPg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Like your customer, this bug has blown up my standby servers, twice in the
last month: the first time all 4 replicas, the second time (mysteriously
but luckily) only 1 of them.
At any rate, since the fix isn't available yet, is/are there any
configuration changes that can be made or maintenance procedures that can
be undertaken to prevent or reduce the probability of this bug popping up
again in the meantime? I really can't afford to be without my standby
servers during the holidays, even for the few hours it takes to build a new
one.
On Tue, May 28, 2013 at 11:58 AM, Andres Freund <andres(at)2ndquadrant(dot)com>wrote:
> Hi,
>
> A customer of ours reporting a standby loosing sync with the primary due
> to the following error:
> CONTEXT: xlog redo visible: rel 1663/XXX/XXX; blk 173717
> WARNING: page 173717 of relation base/XXX/XXX is uninitialized
> ...
> PANIC: WAL contains references to invalid pages
>
> Guessing around I looked and noticed the following problematic pattern:
> 1) A: wants to do an update, doesn't have enough freespace
> 2) A: extends the relation on the filesystem level
> (RelationGetBufferForTuple)
> 3) A: does PageInit (RelationGetBufferForTuple)
> 4) A: aborts, e.g. due to a serialization failure (heap_update)
>
> At this point the page is initialized in memory, but not wal logged. It
> isn't pinned or locked either.
>
> 5) B: vacuum finds that page and it's empty. So it marks it all
> visible. But since the page wasn't written out (we haven't even marked
> it dirty in 3.) the standby doesn't know that and reports the page as
> being uninitialized.
>
> ISTM the best backbranchable fix for this is to teach lazy_scan_heap to
> log an FPI for the heap page via visibilitymap_set in that rather
> limited case.
>
> Happy to provide a patch unless somebody has a better idea?
>
> Greetings,
>
> Andres Freund
>
> --
> Andres Freund http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2013-11-11 00:42:07 | Re: all_visible replay aborting due to uninitialized pages |
Previous Message | Andres Freund | 2013-11-10 23:57:03 | Re: Re: [BUGS] BUG #7873: pg_restore --clean tries to drop tables that don't exist |