Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?
Date: 2018-03-04 03:07:19
Message-ID: acdb289e-5ce7-c5fa-2d93-295ce073a99b@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03/04/2018 03:40 AM, Andres Freund wrote:
>
>
> On March 3, 2018 6:36:51 PM PST, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> On 03/04/2018 03:20 AM, Thomas Munro wrote:
>>> Hi,
>>>
>>> I saw a one-off failure like this:
>>>
>>> QUERY PLAN
>>>
>> --------------------------------------------------------------------------
>>> Aggregate (actual rows=1 loops=1)
>>> ! -> Nested Loop (actual rows=98000 loops=1)
>>> -> Seq Scan on tenk2 (actual rows=10 loops=1)
>>> Filter: (thousand = 0)
>>> Rows Removed by Filter: 9990
>>> ! -> Gather (actual rows=9800 loops=10)
>>> Workers Planned: 4
>>> Workers Launched: 4
>>> -> Parallel Seq Scan on tenk1 (actual rows=1960
>> loops=50)
>>> --- 485,495 ----
>>> QUERY PLAN
>>>
>> --------------------------------------------------------------------------
>>> Aggregate (actual rows=1 loops=1)
>>> ! -> Nested Loop (actual rows=97984 loops=1)
>>> -> Seq Scan on tenk2 (actual rows=10 loops=1)
>>> Filter: (thousand = 0)
>>> Rows Removed by Filter: 9990
>>> ! -> Gather (actual rows=9798 loops=10)
>>> Workers Planned: 4
>>> Workers Launched: 4
>>> -> Parallel Seq Scan on tenk1 (actual rows=1960
>> loops=50)
>>>
>>>
>>> Two tuples apparently went missing.
>>>
>>> Similar failures on the build farm:
>>>
>>>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=okapi&dt=2018-03-03%2020%3A15%3A01
>>>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2018-03-03%2018%3A13%3A32
>>>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2018-03-03%2017%3A55%3A11
>>>
>>> Could this be related to commit
>>> 34db06ef9a1d7f36391c64293bf1e0ce44a33915 or commit
>>> 497171d3e2aaeea3b30d710b4e368645ad07ae43?
>>>
>>
>> I think the same failure (or at least very similar plan diff) was
>> already mentioned here:
>>
>> https://www.postgresql.org/message-id/17385.1520018934@sss.pgh.pa.us
>>
>> So I guess someone else already noticed, but I don't see the cause
>> identified in that thread.
>
> Robert and I started discussing it a bit over IM. No conclusion. Robert tried to reproduce locally, including disabling atomics, without luck.
>
> Can anybody reproduce locally?
>

I've started "make check" with parallel_schedule tweaked to contain many
select_parallel runs, and so far I've seen a couple of failures like
this (about 10 failures out of 1500 runs):

select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and
tenk2.thousand=0;
! ERROR: lost connection to parallel worker

I have no idea why the worker fails (no segfaults in dmesg, nothing in
posgres log), or if it's related to the issue discussed here at all.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2018-03-04 03:11:52 Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?
Previous Message Thomas Munro 2018-03-04 02:51:07 Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?