Re: "could not find pathkey item to sort" for TPC-DS queries 94-96

From: Luc Vlaming <luc(at)swarm64(dot)com>
To: James Coleman <jtc331(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: "could not find pathkey item to sort" for TPC-DS queries 94-96
Date: 2021-04-15 09:33:56
Message-ID: d34620b9-3585-553a-3b76-ef7151c16f0e@swarm64.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 15-04-2021 04:01, James Coleman wrote:
> On Wed, Apr 14, 2021 at 5:42 PM James Coleman <jtc331(at)gmail(dot)com> wrote:
>>
>> On Mon, Apr 12, 2021 at 8:37 AM Tomas Vondra
>> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>>
>>> On 4/12/21 2:24 PM, Luc Vlaming wrote:
>>>> Hi,
>>>>
>>>> When trying to run on master (but afaik also PG-13) TPC-DS queries 94,
>>>> 95 and 96 on a SF10 I get the error "could not find pathkey item to sort".
>>>> When I disable enable_gathermerge the problem goes away and then the
>>>> plan for query 94 looks like below. I tried figuring out what the
>>>> problem is but to be honest I would need some pointers as the code that
>>>> tries to matching equivalence members in prepare_sort_from_pathkeys is
>>>> something i'm really not familiar with.
>>>>
>>>
>>> Could be related to incremental sort, which allowed some gather merge
>>> paths that were impossible before. We had a couple issues related to
>>> that fixed in November, IIRC.
>>>
>>>> To reproduce you can either ingest and test using the toolkit I used too
>>>> (see https://github.com/swarm64/s64da-benchmark-toolkit/) or
>>>> alternatively just use the schema (see
>>>> https://github.com/swarm64/s64da-benchmark-toolkit/tree/master/benchmarks/tpcds/schemas/psql_native)
>>>>
>>>
>>> Thanks, I'll see if I can reproduce that with your schema.
>>>
>>>
>>> regards
>>>
>>> --
>>> Tomas Vondra
>>> EnterpriseDB: http://www.enterprisedb.com
>>> The Enterprise PostgreSQL Company
>>
>> The query in question is:
>>
>> select count(*)
>> from store_sales
>> ,household_demographics
>> ,time_dim, store
>> where ss_sold_time_sk = time_dim.t_time_sk
>> and ss_hdemo_sk = household_demographics.hd_demo_sk
>> and ss_store_sk = s_store_sk
>> and time_dim.t_hour = 15
>> and time_dim.t_minute >= 30
>> and household_demographics.hd_dep_count = 7
>> and store.s_store_name = 'ese'
>> order by count(*)
>> limit 100;
>>
>> From debugging output it looks like this is the plan being chosen
>> (cheapest total path):
>> Gather(store_sales household_demographics time_dim) rows=60626
>> cost=3145.73..699910.15
>> HashJoin(store_sales household_demographics time_dim)
>> rows=25261 cost=2145.73..692847.55
>> clauses: store_sales.ss_hdemo_sk =
>> household_demographics.hd_demo_sk
>> HashJoin(store_sales time_dim) rows=252609
>> cost=1989.73..692028.08
>> clauses: store_sales.ss_sold_time_sk =
>> time_dim.t_time_sk
>> SeqScan(store_sales) rows=11998564
>> cost=0.00..658540.64
>> SeqScan(time_dim) rows=1070
>> cost=0.00..1976.35
>> SeqScan(household_demographics) rows=720
>> cost=0.00..147.00
>>
>> prepare_sort_from_pathkeys fails to find a pathkey because
>> tlist_member_ignore_relabel returns null -- which seemed weird because
>> the sortexpr is an Aggref (in a single member equivalence class) and
>> the tlist contains a single member that's also an Aggref. It turns out
>> that the only difference between the two Aggrefs is that the tlist
>> entry has "aggsplit = AGGSPLIT_INITIAL_SERIAL" while the sortexpr has
>> aggsplit = AGGSPLIT_SIMPLE.
>>
>> That's as far as I've gotten so far, but I figured I'd get that info
>> out to see if it means anything obvious to anyone else.
>
> This really goes back to [1] where we fixed a similar issue by making
> find_em_expr_usable_for_sorting_rel parallel the rules in
> prepare_sort_from_pathkeys.
>
> Most of those conditions got copied, and the case we were trying to
> handle is the fact that prepare_sort_from_pathkeys can generate a
> target list entry under those conditions if one doesn't exist. However
> there's a further restriction there I don't remember looking at: it
> uses pull_var_clause and tlist_member_ignore_relabel to ensure that
> all of the vars that feed into the sort expression are found in the
> target list. As I understand it, that is: it will build a target list
> entry for something like "md5(column)" if "column" (and that was one
> of our test cases for the previous fix) is in the target list already.
>
> But there's an additional detail here: the call to pull_var_clause
> requests aggregates, window functions, and placeholders be treated as
> vars. That means for our Aggref case it would require that the two
> Aggrefs be fully equal, so the differing aggsplit member would cause a
> target list entry not to be built, hence our error here.
>
> I've attached a quick and dirty patch that encodes that final rule
> from prepare_sort_from_pathkeys into
> find_em_expr_usable_for_sorting_rel. I can't help but think that
> there's a cleaner way to do with this with less code duplication, but
> hindering that is that prepare_sort_from_pathkeys is working with a
> TargetList while find_em_expr_usable_for_sorting_rel is working with a
> list of expressions.
>
> James
>
> 1: https://www.postgresql.org/message-id/CAAaqYe9C3f6A_tZCRfr9Dm7hPpgGwpp4i-K_%3DNS9GWXuNiFANg%40mail.gmail.com
>

Hi,

The patch seems to make the planner proceed and not error out anymore.
Cannot judge if it's doing the right thing however or if its enough :)
It works for me for all reported queries however (queries 94-96).

And sorry for the confusion wrt the stacktrace and plan. I tried to
produce a plan to possibly help with debugging but that would ofc then
not have the problem of the missing sortkey as otherwise i cannot
present a plan :) The stacktrace was however correct, and the plan
considered involved a gather-merge with a sort. Unfortunately I could
not (easily) get the plan outputted in the end; even when setting the
costs to 0 somehow...

Regards,
Luc

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2021-04-15 10:22:52 Re: Truncate in synchronous logical replication failed
Previous Message Amit Kapila 2021-04-15 09:16:35 Re: Replication slot stats misgivings