Re: optimizing a join against a windowed function

From: Torsten Förtsch <tfoertsch123(at)gmail(dot)com>
To: James Brown <james(at)instrumentl(dot)com>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org, Gabriel Sinkin <gabe(at)instrumentl(dot)com>
Subject: Re: optimizing a join against a windowed function
Date: 2024-08-30 14:42:43
Message-ID: CAKkG4_ncUgaNGRCNpwQiL1tWkkWpByiLYZD2hU8r2WGzDWM=dQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Try perhaps something along these lines:

```
SELECT t.id, i.insight_id
FROM taxpayers AS t
JOIN LATERAL (
SELECT x.id AS insight_id
FROM insights AS x
WHERE x.taxpayer_id = t.id
AND x.year IS NOT NULL
ORDER BY year DESC
LIMIT 1
) AS i ON true
WHERE t.id IN (?, ?)
```

If you don't have millions of ? in that IN clause, then that might be
faster.

On Fri, Aug 30, 2024 at 1:36 PM James Brown <james(at)instrumentl(dot)com> wrote:

> Hello:
>
> I'm attempting to figure out whether an optimizer behavior I'm seeing is a
> PostgreSQL bug or expected behavior. The scenario:
>
> I have two tables: one named taxpayers which has a goodish number of
> columns an an integer PK id, and one named insights, which has
> a taxpayer_id foreign key to taxpayers, a year, and (again) a lot of other
> columns. There's an index on insights (taxpayer_id, year DESC). I'm
> executing the following SQL:
>
> ```
> SELECT taxpayers.id, insight_id
> FROM taxpayers
> JOIN (
> WITH ordered_insights AS (
> SELECT taxpayer_id, id, RANK() OVER (PARTITION BY taxpayer_id
> ORDER BY year DESC) AS rank
> FROM insights
> WHERE year IS NOT NULL
> )
> SELECT taxpayer_id, id AS insight_id
> FROM ordered_insights
> WHERE rank = 1
> ) latest_insights ON latest_insights.taxpayer_id = taxpayers.id
> WHERE taxpayers.id IN (?, ?)
> ```
>
> (this is simplified example; the real code has the subselect in a view so
> that it can execute this kind of join from an ORM; it also joins quite a
> few tables downstream after this)
>
> If there's only a single value in the IN clause, the EXPLAIN plan looks
> great:
>
> Nested Loop (cost=0.86..53.30 rows=1 width=16)
> -> Index Only Scan using taxpayers_pkey on taxpayers (cost=0.43..8.45
> rows=1 width=8)
> Index Cond: (id = 650974)
> -> Subquery Scan on ordered_insights (cost=0.43..44.83 rows=1 width=16)
> Filter: (ordered_insights.rank = 1)
> -> WindowAgg (cost=0.43..44.71 rows=10 width=28)
> Run Condition: (rank() OVER (?) <= 1)
> -> Index Scan using index_insights_on_taxpayer_id_year_desc
> on insights (cost=0.43..44.53 rows=10 width=20)
> Index Cond: ((taxpayer_id = 650974) AND (year IS NOT
> NULL))
> (9 rows)
>
> However, if there are multiple rows in the IN clause, the optimizer
> decides to execute the subselect against the entire giant table, and it is
> not great:
>
> Hash Join (cost=2611586.97..2800201.15 rows=1 width=16)
> Hash Cond: (ordered_insights.taxpayer_id = taxpayers.id)
> -> Subquery Scan on ordered_insights (cost=2611570.10..2799818.65
> rows=28961 width=16)
> Filter: (ordered_insights.rank = 1)
> -> WindowAgg (cost=2611570.10..2727415.36 rows=5792263 width=28)
> Run Condition: (rank() OVER (?) <= 1)
> -> Sort (cost=2611570.10..2626050.76 rows=5792263 width=20)
> Sort Key: insights.taxpayer_id, insights.year DESC
> -> Seq Scan on insights (cost=0.00..1723354.01
> rows=5792263 width=20)
> Filter: (year IS NOT NULL)
> -> Hash (cost=16.85..16.85 rows=2 width=8)
> -> Index Only Scan using taxpayers_pkey on taxpayers
> (cost=0.43..16.85 rows=2 width=8)
> Index Cond: (id = ANY ('{650974,243848}'::bigint[]))
>
> If I add in a second repetitive WHERE clause, it goes back to being happy,
> but that feels a bit like a hack:
>
> # EXPLAIN SELECT taxpayers.id, insight_id
> FROM taxpayers
> JOIN (
> WITH ordered_insights AS (
> SELECT taxpayer_id, id, RANK() OVER (PARTITION BY taxpayer_id
> ORDER BY year DESC) AS rank
> FROM insights
> WHERE year IS NOT NULL
> )
> SELECT taxpayer_id, id AS insight_id
> FROM ordered_insights
> WHERE rank = 1
> ) latest_insights ON latest_insights.taxpayer_id = taxpayers.id
> WHERE taxpayers.id IN (650974, 243848) AND latest_insights.taxpayer_id IN
> (650974, 243848);
> QUERY PLAN
>
>
> -----------------------------------------------------------------------------------------------------------------------------
> Nested Loop (cost=0.86..110.57 rows=1 width=16)
> Join Filter: (taxpayers.id = ordered_insights.taxpayer_id)
> -> Subquery Scan on ordered_insights (cost=0.43..93.69 rows=1 width=16)
> Filter: (ordered_insights.rank = 1)
> -> WindowAgg (cost=0.43..93.42 rows=21 width=28)
> Run Condition: (rank() OVER (?) <= 1)
> -> Index Scan using index_insights_on_taxpayer_id_year_desc
> on insights (cost=0.43..93.06 rows=21 width=20)
> Index Cond: ((taxpayer_id = ANY
> ('{650974,243848}'::bigint[])) AND (year IS NOT NULL))
> -> Index Only Scan using taxpayers_pkey on taxpayers (cost=0.43..16.85
> rows=2 width=8)
> Index Cond: (id = ANY ('{650974,243848}'::bigint[]))
>
> This feels like a bug to me, but maybe I'm missing something obvious. I
> don't really get why the optimizer wouldn't be able to infer the second
> condition given that I'm doing a join on a non-nullable integer column (so
> there's no NaN nonsense to worry about), but maybe I'm missing something
> obvious.
>
> I've reproduced this on PostgreSQL 15.7 and 17beta3.
>
> Thanks for any insights y'all can provide!
>
> --
> James Brown
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Peter J. Holzer 2024-08-31 16:54:45 Upgrade Ubuntu 22 -> 24 may break PostgreSQL
Previous Message David G. Johnston 2024-08-30 13:48:30 Re: default privileges are npt working