From: | "Gunnar \"Nick\" Bluth" <gunnar(dot)bluth(dot)extern(at)elster(dot)de> |
---|---|
To: | pgsql-performance(at)postgresql(dot)org |
Subject: | Re: select subquery versus join subquery |
Date: | 2017-05-23 11:03:37 |
Message-ID: | a7cc091e-b73e-389e-739d-c7d5b480d58d@elster.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Am 05/22/2017 um 09:57 PM schrieb Jeff Janes:
> I need to make a view which decorates rows from a parent table with
> aggregated values from a child table. I can think of two ways to write
> it, one which aggregates the child table to make a new query table and
> joins the parent to that, as shown in "view1" below. Or does subselect
> in the select list to aggregate just the currently matching rows, and
> returns that value, as in "view2" below.
>
> While these two are semantically equivalent, the planner doesn't
> understand that, and always executes them pretty much the way you would
> naively do it based on the text of the query.
>
> But view1 is pretty slow if the WHERE clause is highly selective (like
> "WHERE cutoff<0.00001") because it has to summarize the entire child
> table just to pull out a few rows. But view2 is pretty slow if the
> entire view or most of it (like "WHERE cutoff<0.9") is being returned.
>
> Is there some 3rd way to write the query which allows the planner to
> switch between strategies (summarize whole table vs summarize values on
> demand) depending on the known selectivity of the where clause?
>
> In this case, the planner is getting the relative cost estimates roughly
> correct. It is not a problem of mis-estimation.
>
> I can always create two views, view_small and view_large, and swap
> between them based on my own knowledge of how restrictive a query is
> likely to be, but that is rather annoying. Especially in the real-world
> situation, which is quite a bit more complex than this.
>
> create table thing as select x as id, random() as cutoff from
> generate_series(1,2000000) f(x);
>
> create table thing_alias as select
> floor(power(random()*power(2000000,5),0.2))::int thing_id, md5(x::text),
> random() as priority from generate_series(1,150000) f(x);
>
> create index on thing_alias (thing_id );
>
> create index on thing (cutoff );
>
> vacuum; analyze;
>
> create view view1 as select id, md5,cutoff from thing left join
> (
> select distinct on (thing_id) thing_id, md5 from thing_alias
> order by thing_id, priority desc
> ) as foo
> on (thing_id=id);
>
> create view view2 as select id,
> (
> select md5 from thing_alias where thing_id=id
> order by priority desc limit 1
> ) as md5,
> cutoff from thing;
>
> Cheers,
>
> Jeff
Hi Jeff,
how does something like
CREATE OR REPLACE VIEW public.view3 AS
SELECT thing.id,
foo.md5,
thing.cutoff
FROM thing,
LATERAL ( SELECT DISTINCT ON (thing_alias.thing_id)
thing_alias.thing_id,
thing_alias.md5
FROM thing_alias
WHERE thing_alias.thing_id = thing.id
ORDER BY thing_alias.thing_id, thing_alias.priority DESC) foo
work for you? At least that's always using an index scan here, as
opposed to view1, which (for me) defaults to a SeqScan on thing_alias at
a low cutoff.
*****
Note btw. that both view1 and view2 don't return any md5 values for me,
while view3 does!
*****
Results (ms, median of 3 runs):
cutoff< 0.1 0.9
view1: 348 1022
view2: 844 6484
view3: 842 5976
With
LATERAL ( SELECT string_agg(thing_alias.md5, ','::text) AS md5
FROM thing_alias
WHERE thing_alias.thing_id = thing.id
GROUP BY thing_alias.thing_id) foo
(which seems to make more sense ;-)
I yield 483 (0.1) and 3410 (0.9) ms (and return md5-Aggregates).
Cheers,
--
Gunnar "Nick" Bluth
DBA ELSTER
Tel: +49 911/991-4665
Mobil: +49 172/8853339
From | Date | Subject | |
---|---|---|---|
Next Message | Dinesh Chandra 12108 | 2017-05-23 12:42:52 | Log update query along with username who has executed the same. |
Previous Message | Pavel Stehule | 2017-05-23 04:41:36 | Re: Can postgresql plan a query using multiple CPU cores? |