Re: Fix incorrect start up costs for WindowAgg paths (bug #17862)

From: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tim Palmer <tim3sp(at)gmail(dot)com>
Subject: Re: Fix incorrect start up costs for WindowAgg paths (bug #17862)
Date: 2023-08-03 14:02:10
Message-ID: CAKU4AWoDKp3jtZGAbagb_e6HontVCQ1BO7xky9xw7egkxnUWAQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 3, 2023 at 7:29 PM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:

> Thanks for having a look at this.
>
> On Thu, 3 Aug 2023 at 18:49, Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com> wrote:
> > 1. ORDER BY or PARTITION BY
> >
> > select *, count(two) over (order by unique1) from tenk1 limit 1;
> > DEBUG: startup_tuples = 1
> > DEBUG: startup_tuples = 1
> >
> > select *, count(two) over (partition by unique1) from tenk1 limit 1;
> > DEBUG: startup_tuples = 1
> > DEBUG: startup_tuples = 1
> >
> > Due to the Executor of nodeWindowAgg, we have to fetch the next tuple
> > until it mismatches with the current one, then we can calculate the
> > WindowAgg function. In the current patch, we didn't count the
> > mismatched tuple. I verified my thought with 'break at IndexNext'
> > function and see IndexNext is called twice, so in the above case the
> > startup_tuples should be 2?
>
> You're probably right here. I'd considered that it wasn't that
> critical and aimed to attempt to keep the estimate close to the number
> of rows that'll be aggregated. I think that's probably not the best
> thing to do as if you consider the EXCLUDE options, those just exclude
> tuples from aggregation, it does not mean we read fewer tuples from
> the subnode. I've updated the patch accordingly.
>

Thanks.

>
> > 2. ORDER BY and PARTITION BY
> >
> > select two, hundred,
> > count(two) over (partition by ten order by hundred)
> > from tenk1 limit 1;
> >
> > DEBUG: startup_tuples = 10
> > two | hundred | count
> > -----+---------+-------
> > 0 | 0 | 100
> >
> > If we consider the mismatched tuples, it should be 101?
>
> I don't really see how we could do better with the current level of
> statistics. The stats don't know that there are only 10 distinct
> "hundred" values for rows which have ten=1. All we have is n_distinct
> on tenk1.hundred, which is 100.

Yes, actually I didn't figure it out before / after my posting.

>

> > 3. As we can see the log for startup_tuples is logged twice sometimes,
> > the reason is because it is used in cost_windowagg, so it is calculated
> > for every create_one_window_path. I think the startup_tuples should be
> > independent with the physical path, maybe we can cache it somewhere to
> > save some planning cycles?
>
> I wondered about that too but I ended up writing off the idea of
> caching because the input_tuple count comes from the Path and the
> extra calls are coming from other Paths, which could well have some
> completely different value for input_tuples.
>
>
Looks reasonable.

I have checked the updated patch and LGTM.

--
Best Regards
Andy Fan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andy Fan 2023-08-03 14:27:34 Re: Extract numeric filed in JSONB more effectively
Previous Message Pavel Stehule 2023-08-03 13:52:40 Re: Extract numeric filed in JSONB more effectively