select subquery versus join subquery

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: select subquery versus join subquery
Date: 2017-05-22 19:57:39
Message-ID: CAMkU=1zxu+xZ7X44FpZ=Syot_STrUiGR9miB8K-HCqvb=KMtwA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I need to make a view which decorates rows from a parent table with
aggregated values from a child table. I can think of two ways to write it,
one which aggregates the child table to make a new query table and joins
the parent to that, as shown in "view1" below. Or does subselect in the
select list to aggregate just the currently matching rows, and returns that
value, as in "view2" below.

While these two are semantically equivalent, the planner doesn't understand
that, and always executes them pretty much the way you would naively do it
based on the text of the query.

But view1 is pretty slow if the WHERE clause is highly selective (like
"WHERE cutoff<0.00001") because it has to summarize the entire child table
just to pull out a few rows. But view2 is pretty slow if the entire view
or most of it (like "WHERE cutoff<0.9") is being returned.

Is there some 3rd way to write the query which allows the planner to switch
between strategies (summarize whole table vs summarize values on demand)
depending on the known selectivity of the where clause?

In this case, the planner is getting the relative cost estimates roughly
correct. It is not a problem of mis-estimation.

I can always create two views, view_small and view_large, and swap between
them based on my own knowledge of how restrictive a query is likely to be,
but that is rather annoying. Especially in the real-world situation, which
is quite a bit more complex than this.

create table thing as select x as id, random() as cutoff from
generate_series(1,2000000) f(x);

create table thing_alias as select
floor(power(random()*power(2000000,5),0.2))::int thing_id, md5(x::text),
random() as priority from generate_series(1,150000) f(x);

create index on thing_alias (thing_id );

create index on thing (cutoff );

vacuum; analyze;

create view view1 as select id, md5,cutoff from thing left join
(
select distinct on (thing_id) thing_id, md5 from thing_alias
order by thing_id, priority desc
) as foo
on (thing_id=id);

create view view2 as select id,
(
select md5 from thing_alias where thing_id=id
order by priority desc limit 1
) as md5,
cutoff from thing;

Cheers,

Jeff

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Clemens Eisserer 2017-05-22 20:21:45 Can postgresql plan a query using multiple CPU cores?
Previous Message Merlin Moncure 2017-05-22 15:37:52 Re: Bulk persistence strategy