Re: enable_incremental_sort changes query behavior

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: James Coleman <jtc331(at)gmail(dot)com>, Jaime Casanova <jaime(dot)casanova(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: enable_incremental_sort changes query behavior
Date: 2020-10-07 22:34:38
Message-ID: 20201007223438.npgyo4qqhzkztdnd@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 07, 2020 at 03:52:04PM -0400, Robert Haas wrote:
>On Thu, Oct 1, 2020 at 9:03 AM James Coleman <jtc331(at)gmail(dot)com> wrote:
>> The plan (obtained by replacing the volatile function with a stable one):
>>
>> Unique
>> -> Nested Loop
>> -> Gather Merge
>> Workers Planned: 2
>> -> Sort
>> Sort Key: ref_0.i, (md5(ref_0.t))
>> -> Parallel Index Scan using ref_0_i_idx on ref_0
>> -> Function Scan on generate_series ref_1
>>
>> Changing `md5(ref_0.t)` to `random()::text || ref_0.t` causes the error.
>
>As far as I remember this stuff, target lists for any nodes below the
>top of the join tree were previously always just Var nodes. Any
>expressions that needed to be computed were computed at the level of
>the topmost join, before adding upper planner steps like Agg,
>WindowAgg, Limit, etc. But, here you've got a complex expression --
>md5(ref_0.t) -- begin computed somewhere in the middle of the plan
>tree so that the sort can be performed at that point. However, that
>seems likely to break a bunch of assumptions all over the planner,
>because, unless a lot of work and surgery has been done since I looked
>at this last, that md5(ref_0.t) value is not going to "bubble up" to
>higher levels of the plan through the tlists of the individual nodes.
>
>That poses a risk that it will be recomputed multiple times, which is
>particularly bad for volatile functions because you might not always
>get the same answer, but it seems like it might be bad even for
>non-volatile functions, because it could be slow if the value is used
>at multiple places in the query. It feels like the sort of thing that
>might have broken some other assumptions, too, although I can't say
>exactly what.

I agree this might be a clear correctness issue for volatile functions.
Not sure about stable functions, though - it's easy to construct cases
where pushing the function down results in much faster execution.

Consider this:

create table t (id int, val text);
insert into t select 1, repeat('x', 1000) from generate_series(1,100) s(i);

-- no pushdown
select id, md5(t1.val) from t t1 join t t2 using (id) join t t3 using (id);
Time: 5222.895 ms (00:05.223)

-- manual pushdown
with x as (select id, md5(t.val) from t)
select id, t1.md5 from x t1 join x t2 using (id) join x t3 using (id);
Time: 486.455 ms

Of course, this is independent of what the planner assumes correctly,
which is what's being discussed in this thread. And I'm sure there are
plenty counter-examples too, so we'd need to cost this properly.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2020-10-07 22:36:36 Re: Recent failures on buildfarm member hornet
Previous Message Tomas Vondra 2020-10-07 22:22:43 Re: enable_incremental_sort changes query behavior