Re: pgsql: Support partition pruning at execution time

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, pgsql-committers(at)lists(dot)postgresql(dot)org
Subject: Re: pgsql: Support partition pruning at execution time
Date: 2018-04-12 13:17:33
Message-ID: CAKJS1f8o2Yd=rOP=Et3A0FWgF+gSAOkFSU6eNhnGzTPV7nN8sQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

On 11 April 2018 at 18:58, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> wrote:
> On 10 April 2018 at 08:55, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:
>>> David Rowley wrote:
>>>> Okay, I've written and attached a fix for this. I'm not 100% certain
>>>> that this is the cause of the problem on pademelon, but the code does
>>>> look wrong, so needs to be fixed. Hopefully, it'll make pademelon
>>>> happy, if not I'll think a bit harder about what might be causing that
>>>> instability.
>>
>>> Pushed it just now. Let's see what happens with pademelon now.
>>
>> I've had pademelon's host running a "make installcheck" loop all day
>> trying to reproduce the problem. I haven't gotten a bite yet (although
>> at 15+ minutes per cycle, this isn't a huge number of tests). I think
>> we were remarkably (un)lucky to see the problem so quickly after the
>> initial commit, and I'm afraid pademelon isn't going to help us prove
>> much about whether this was the same issue.
>>
>> This does remind me quite a bit though of the ongoing saga with the
>> postgres_fdw test instability. Given the frequency with which that's
>> failing in the buildfarm, you would not think it's impossible to
>> reproduce outside the buildfarm, and yet I'm here to tell you that
>> it's pretty damn hard. I haven't succeeded yet, and that's not for
>> lack of trying. Could there be something about the buildfarm
>> environment that makes these sorts of things more likely?
>
> coypu just demonstrated that this was not the cause of the problem [1]
>
> I'll study the code a bit more and see if I can think why this might
> be happening.
>
> [1] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=coypu&dt=2018-04-11%2004%3A17%3A38&stg=install-check-C

I've spent a bit of time tonight trying to dig into this problem to
see if I can figure out what's going on.

I ended up running the following script on both a Linux x86_64 machine
and also a power8 machine.

#!/bin/bash
for x in {1..1000}
do
echo "$x";
for i in {1..1000}
do
psql -d postgres -f test.sql -o test.out
diff -u test.out test.expect
done
done

I was unable to recreate this problem after about 700k loops on the
Linux machine and 130k loops on the power8.

I've emailed the owner of coypu to ask if it would be possible to get
access to the machine, or have him run the script to see if it does
actually fail. Currently waiting to hear back.

I've made another pass over the nodeAppend.c code and I'm unable to
see what might cause this, although I did discover a bug where
first_partial_plan is not set taking into account that some subplans
may have been pruned away during executor init. The only thing I think
this would cause is for parallel workers to not properly help out with
some partial plans if some earlier subplans were pruned. I can see no
reason for this to have caused this particular issue since the
first_partial_plan would be 0 with and without the attached fix.

Tom, would there be any chance you could run the above script for a
while on pademelon to see if it can in fact reproduce the problem?
coypu did show this problem in the install check, so I don't think it
will need the other concurrent tests to fail. If you can recreate,
after adjusting the expected output, does the problem still exist in
5c0675215?

I also checked with other tests perform an EXPLAIN ANALYZE of a plan
with a Parallel Append and I see there's none. So I've not ruled out
that this is an existing bug. git grep "explain.*analyze" also does
not show much outside of the partition_prune tests either.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
test.sql text/plain 464 bytes
test.expect application/octet-stream 1.5 KB
first_partial_plan_fix.patch application/octet-stream 3.4 KB
setup.sql text/plain 837 bytes

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Teodor Sigaev 2018-04-12 13:38:11 pgsql: Cleanup covering infrastructure
Previous Message Simon Riggs 2018-04-12 11:40:34 pgsql: Revert MERGE patch

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2018-04-12 13:22:22 Re: Creation of wiki page for open items of v11
Previous Message Pavan Deolasee 2018-04-12 13:12:46 Re: Bugs in TOAST handling, OID assignment and redo recovery