From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | Justin Pryzby <pryzby(at)telsasoft(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: stress test for parallel workers |
Date: | 2019-07-24 05:15:14 |
Message-ID: | 17389.1563945314@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
> On Wed, Jul 24, 2019 at 10:11 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> In any case, the evidence from the buildfarm is pretty clear that
>> there is *some* connection. We've seen a lot of recent failures
>> involving "postmaster exited during a parallel transaction", while
>> the number of postmaster failures not involving that is epsilon.
> I don't have access to the build farm history in searchable format
> (I'll go and ask for that).
Yeah, it's definitely handy to be able to do SQL searches in the
history. I forget whether Dunstan or Frost is the person to ask
for access, but there's no reason you shouldn't have it.
> Do you have an example to hand? Is this
> failure always happening on Linux?
I dug around a bit further, and while my recollection of a lot of
"postmaster exited during a parallel transaction" failures is accurate,
there is a very strong correlation I'd not noticed: it's just a few
buildfarm critters that are producing those. To wit, I find that
string in these recent failures (checked all runs in the past 3 months):
sysname | branch | snapshot
-----------+---------------+---------------------
lorikeet | HEAD | 2019-06-16 20:28:25
lorikeet | HEAD | 2019-07-07 14:58:38
lorikeet | HEAD | 2019-07-02 10:38:08
lorikeet | HEAD | 2019-06-14 14:58:24
lorikeet | HEAD | 2019-07-04 20:28:44
lorikeet | HEAD | 2019-04-30 11:00:49
lorikeet | HEAD | 2019-06-19 20:29:27
lorikeet | HEAD | 2019-05-21 08:28:26
lorikeet | REL_11_STABLE | 2019-07-11 08:29:08
lorikeet | REL_11_STABLE | 2019-07-09 08:28:41
lorikeet | REL_12_STABLE | 2019-07-16 08:28:37
lorikeet | REL_12_STABLE | 2019-07-02 21:46:47
lorikeet | REL9_6_STABLE | 2019-07-02 20:28:14
vulpes | HEAD | 2019-06-14 09:18:18
vulpes | HEAD | 2019-06-27 09:17:19
vulpes | HEAD | 2019-07-21 09:01:45
vulpes | HEAD | 2019-06-12 09:11:02
vulpes | HEAD | 2019-07-05 08:43:29
vulpes | HEAD | 2019-07-15 08:43:28
vulpes | HEAD | 2019-07-19 09:28:12
wobbegong | HEAD | 2019-06-09 20:43:22
wobbegong | HEAD | 2019-07-02 21:17:41
wobbegong | HEAD | 2019-06-04 21:06:07
wobbegong | HEAD | 2019-07-14 20:43:54
wobbegong | HEAD | 2019-06-19 21:05:04
wobbegong | HEAD | 2019-07-08 20:55:18
wobbegong | HEAD | 2019-06-28 21:18:46
wobbegong | HEAD | 2019-06-02 20:43:20
wobbegong | HEAD | 2019-07-04 21:01:37
wobbegong | HEAD | 2019-06-14 21:20:59
wobbegong | HEAD | 2019-06-23 21:36:51
wobbegong | HEAD | 2019-07-18 21:31:36
(32 rows)
We already knew that lorikeet has its own peculiar stability
problems, and these other two critters run different compilers
on the same Fedora 27 ppc64le platform.
So I think I've got to take back the assertion that we've got
some lurking generic problem. This pattern looks way more
like a platform-specific issue. Overaggressive OOM killer
would fit the facts on vulpes/wobbegong, perhaps, though
it's odd that it only happens on HEAD runs.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2019-07-24 05:15:21 | Re: Change atoi to strtol in same place |
Previous Message | Paul A Jungwirth | 2019-07-24 05:13:07 | Re: range_agg |