Re: select subquery versus join subquery

From: "Gunnar \"Nick\" Bluth" <gunnar(dot)bluth(dot)extern(at)elster(dot)de>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: select subquery versus join subquery
Date: 2017-05-23 11:03:37
Message-ID: a7cc091e-b73e-389e-739d-c7d5b480d58d@elster.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Am 05/22/2017 um 09:57 PM schrieb Jeff Janes:
> I need to make a view which decorates rows from a parent table with
> aggregated values from a child table. I can think of two ways to write
> it, one which aggregates the child table to make a new query table and
> joins the parent to that, as shown in "view1" below. Or does subselect
> in the select list to aggregate just the currently matching rows, and
> returns that value, as in "view2" below.
>
> While these two are semantically equivalent, the planner doesn't
> understand that, and always executes them pretty much the way you would
> naively do it based on the text of the query.
>
> But view1 is pretty slow if the WHERE clause is highly selective (like
> "WHERE cutoff<0.00001") because it has to summarize the entire child
> table just to pull out a few rows. But view2 is pretty slow if the
> entire view or most of it (like "WHERE cutoff<0.9") is being returned.
>
> Is there some 3rd way to write the query which allows the planner to
> switch between strategies (summarize whole table vs summarize values on
> demand) depending on the known selectivity of the where clause?
>
> In this case, the planner is getting the relative cost estimates roughly
> correct. It is not a problem of mis-estimation.
>
> I can always create two views, view_small and view_large, and swap
> between them based on my own knowledge of how restrictive a query is
> likely to be, but that is rather annoying. Especially in the real-world
> situation, which is quite a bit more complex than this.
>
> create table thing as select x as id, random() as cutoff from
> generate_series(1,2000000) f(x);
>
> create table thing_alias as select
> floor(power(random()*power(2000000,5),0.2))::int thing_id, md5(x::text),
> random() as priority from generate_series(1,150000) f(x);
>
> create index on thing_alias (thing_id );
>
> create index on thing (cutoff );
>
> vacuum; analyze;
>
> create view view1 as select id, md5,cutoff from thing left join
> (
> select distinct on (thing_id) thing_id, md5 from thing_alias
> order by thing_id, priority desc
> ) as foo
> on (thing_id=id);
>
> create view view2 as select id,
> (
> select md5 from thing_alias where thing_id=id
> order by priority desc limit 1
> ) as md5,
> cutoff from thing;
>
> Cheers,
>
> Jeff

Hi Jeff,

how does something like

CREATE OR REPLACE VIEW public.view3 AS
SELECT thing.id,
foo.md5,
thing.cutoff
FROM thing,
LATERAL ( SELECT DISTINCT ON (thing_alias.thing_id)
thing_alias.thing_id,
thing_alias.md5
FROM thing_alias
WHERE thing_alias.thing_id = thing.id
ORDER BY thing_alias.thing_id, thing_alias.priority DESC) foo

work for you? At least that's always using an index scan here, as
opposed to view1, which (for me) defaults to a SeqScan on thing_alias at
a low cutoff.

*****
Note btw. that both view1 and view2 don't return any md5 values for me,
while view3 does!
*****

Results (ms, median of 3 runs):
cutoff< 0.1 0.9
view1: 348 1022
view2: 844 6484
view3: 842 5976

With

LATERAL ( SELECT string_agg(thing_alias.md5, ','::text) AS md5
FROM thing_alias
WHERE thing_alias.thing_id = thing.id
GROUP BY thing_alias.thing_id) foo

(which seems to make more sense ;-)

I yield 483 (0.1) and 3410 (0.9) ms (and return md5-Aggregates).

Cheers,
--
Gunnar "Nick" Bluth
DBA ELSTER

Tel: +49 911/991-4665
Mobil: +49 172/8853339

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Dinesh Chandra 12108 2017-05-23 12:42:52 Log update query along with username who has executed the same.
Previous Message Pavel Stehule 2017-05-23 04:41:36 Re: Can postgresql plan a query using multiple CPU cores?