Re: backup manifests and contemporaneous buildfarm failures

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, David Steele <david(at)pgmasters(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Suraj Kharage <suraj(dot)kharage(at)enterprisedb(dot)com>, tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, Tels <nospam-pg-abuse(at)bloodgate(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: backup manifests and contemporaneous buildfarm failures
Date: 2020-04-04 13:20:51
Message-ID: CA+TgmoZORBcBvvGrQnyA4dfM-Pcy0nPmTzKO-hEGFCKjpcEuWA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Apr 3, 2020 at 11:06 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2020-04-03 20:48:09 -0400, Robert Haas wrote:
> > 'serinus' is also failing. This is less obviously related:
>
> Hm. Tests passed once since then.

Yeah, but conchuela also failed once in what I think was a similar
way. I suspect the fix I pushed last night
(3e0d80fd8d3dd4f999e0d3aa3e591f480d8ad1fd) may have been enough to
clear this up.

> That already seems suspicious. I checked the following (successful) run
> and I did not see that in the stage's logs.

Yeah, the behavior of the test case doesn't seem to be entirely deterministic.

> I, again, have to say that the amount of stuff that was done as part of
>
> commit 7c4f52409a8c7d85ed169bbbc1f6092274d03920
> Author: Peter Eisentraut <peter_e(at)gmx(dot)net>
> Date: 2017-03-23 08:36:36 -0400
>
> Logical replication support for initial data copy
>
> is insane. Adding support for running sql over replication connections
> and extending CREATE_REPLICATION_SLOT with new options (without even
> mentioning that in the commit message!) as part of a commit described as
> "Logical replication support for initial data copy" shouldn't happen.

I agreed then and still do.

> So I'm a bit confused here. The best approach is probably to try to
> reproduce this by adding an artifical delay into backend shutdown.

I was able to reproduce an assertion failure by starting a
transaction, running a replication command that failed, and then
exiting the backend. 3e0d80fd8d3dd4f999e0d3aa3e591f480d8ad1fd made
that go away. I had wrongly assumed that there was no other way for a
walsender to have a ResourceOwner, and in the face of SQL commands
also being executed by walsenders, that's clearly not true. I'm not
sure *precisely* how that lead to the BF failures, but it was really
clear that it was wrong.

> > (I still really dislike the fact that we have this evil hack allowing
> > one connection to mix and match those sets of commands...)
>
> FWIW, I think the opposite. We should get rid of the difference as much
> as possible.

Well, that's another approach. It's OK to have one system and it's OK
to have two systems, but one and a half is not ideal.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2020-04-04 13:34:52 Re: backup manifests
Previous Message Jürgen Purtz 2020-04-04 12:30:14 Re: Add A Glossary