Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?
Date: 2018-03-04 02:48:35
Message-ID: CAEepm=0pAqV8Nqdt2D6+nH=bkTkchKjL46tokcGUmGo5P7+4zg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Mar 4, 2018 at 3:40 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On March 3, 2018 6:36:51 PM PST, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>On 03/04/2018 03:20 AM, Thomas Munro wrote:
>>> Hi,
>>>
>>> I saw a one-off failure like this:
>>>
>>> QUERY PLAN
>>>
>>--------------------------------------------------------------------------
>>> Aggregate (actual rows=1 loops=1)
>>> ! -> Nested Loop (actual rows=98000 loops=1)
>>> -> Seq Scan on tenk2 (actual rows=10 loops=1)
>>> Filter: (thousand = 0)
>>> Rows Removed by Filter: 9990
>>> ! -> Gather (actual rows=9800 loops=10)
>>> Workers Planned: 4
>>> Workers Launched: 4
>>> -> Parallel Seq Scan on tenk1 (actual rows=1960
>>loops=50)
>>> --- 485,495 ----
>>> QUERY PLAN
>>>
>>--------------------------------------------------------------------------
>>> Aggregate (actual rows=1 loops=1)
>>> ! -> Nested Loop (actual rows=97984 loops=1)
>>> -> Seq Scan on tenk2 (actual rows=10 loops=1)
>>> Filter: (thousand = 0)
>>> Rows Removed by Filter: 9990
>>> ! -> Gather (actual rows=9798 loops=10)
>>> Workers Planned: 4
>>> Workers Launched: 4
>>> -> Parallel Seq Scan on tenk1 (actual rows=1960
>>loops=50)
>>>
>>>
>>> Two tuples apparently went missing.
>>>
>>> Similar failures on the build farm:
>>>
>>>
>>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=okapi&dt=2018-03-03%2020%3A15%3A01
>>>
>>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2018-03-03%2018%3A13%3A32
>>>
>>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2018-03-03%2017%3A55%3A11
>>>
>>> Could this be related to commit
>>> 34db06ef9a1d7f36391c64293bf1e0ce44a33915 or commit
>>> 497171d3e2aaeea3b30d710b4e368645ad07ae43?
>>>
>>
>>I think the same failure (or at least very similar plan diff) was
>>already mentioned here:
>>
>>https://www.postgresql.org/message-id/17385.1520018934@sss.pgh.pa.us
>>
>>So I guess someone else already noticed, but I don't see the cause
>>identified in that thread.

Oh. Sorry, I didn't recognise that as the same thing, from the title.
Doesn't seem to be related to number of workers launched at all... it
looks more like the tuple queue is misbehaving. Though I haven't got
any proof of anything yet.

> Robert and I started discussing it a bit over IM. No conclusion. Robert tried to reproduce locally, including disabling atomics, without luck.
>
> Can anybody reproduce locally?

I've seen it several times on Travis CI. (So I would normally have
been able to tell you about this problem before the was committed,
except that the email thread was too long and the mail archive app
cuts long threads off!) Will try on some different kind of computers
that I have local control off... I suspect (knowing how we run it on
Travis CI) that being way overloaded might be helpful...

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2018-03-04 02:51:07 Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?
Previous Message Andres Freund 2018-03-04 02:40:08 Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?