Re: Fast promotion failure

From: Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To: "'Kyotaro HORIGUCHI'" <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: <masao(dot)fujii(at)gmail(dot)com>, <hlinnakangas(at)vmware(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fast promotion failure
Date: 2013-05-13 03:07:27
Message-ID: 006801ce4f87$005569e0$01003da0$@kapila@huawei.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Monday, May 13, 2013 5:54 AM Kyotaro HORIGUCHI wrote:
> 2013/05/10 20:01 "Amit Kapila" <amit(dot)kapila(at)huawei(dot)com>:
> > > > C 2013-05-10 15:32:32.170 JST 9242 FATAL: could not receive data
> > > from WAL stream:
> >
> > Is there any chance, that there is any network glitch caused this one
> time
> > error.
>
> Unix domam sockets are hardly likely to have such troubles. This
> test ran within single host.
>
> > > I'm get confused, the patch seems to me ensureing the "first
> > > checkpoint after fast promotion is performed" to use the
> > > "correct, new, ThisTimeLineID".
> >
> > What is your confusion?
>
> Heikki said in the fist message in this thread that he suspected
> the cause of the failure he had seen to be wrong TLI on whitch
> checkpointer runs. Nevertheless, the patch you suggested for me
> looks fixing it. Moreover (one of?) the failure from the same
> cause looks fixed with the patch.

There were 2 problems:
1. There was some issue in walsender logic due to which after promotion in
some cases it hits assertion or error
2. During fast promotion, checkpoint gets created with wrong TLI

He has provided 2 different patches
fix-standby-promotion-assert-fail-2.patch and
fast-promotion-quick-fix.patch.
Among 2, he has already committed fix-standby-promotion-assert-fail-2.patch
(http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2ffa66f49
75c99e52984f7ee81b47d137b5b4751)

> Is the point of this discussion that the patch may leave out some
> glich about timing of timeline-related changing and Heikki saw an
> egress of that?

AFAIU, the committed patch has some gap in overall scenario which is the
fast promotion issue.

With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2013-05-13 04:03:47 Re: Add more regression tests for dbcommands
Previous Message Evan D. Hoffman 2013-05-13 02:43:41 Re: Re: [GENERAL] pg_upgrade fails, "mismatch of relation OID" - 9.1.9 to 9.2.4