Re: Use cases for lateral that do not involve a set returning function

From: AJ Welch <awelch0100(at)gmail(dot)com>
To: Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Use cases for lateral that do not involve a set returning function
Date: 2014-12-10 01:21:10
Message-ID: CAO-RzR+Yw1uOMVLn6zKGk6+pnsJVmCDN9V2okK-rNxeD8T4jMA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Thanks for the response. Yea, using lateral there definitely reads better
to me than using a correlated subquery. And it makes sense that performance
is ok since you're filtering on a specific person's id (as you hinted at
with `WHERE p.id = ...`) and the nested loop forced by `order by...limit 1`
presumably only loops once.

However, I would consider that more of an OLTP style query where I'm kind
of more interested in OLAP style queries as the referenced post was
building a funnel analysis over an events table. I don't think their
approach will scale. I guess it's kind of a specific question but that post
got me wondering if there are any use cases for lateral outside of SRFs
that do not generate a nested loop AND can not be achieved without lateral?
Basically, something I could use in an OLAP query that I couldn't use prior
to 9.3.

Also, just for the heck of it, I took a look at the explain plans for both
of your queries without `WHERE p.id = ...` to see how they would scale.

Correlated subquery:

QUERY PLAN
Hash Right Join (cost=163943.00..5932603426739.67 rows=7499298 width=24)
Hash Cond: (n.people_id = p.people_id)
Join Filter: (NOT (SubPlan 1))
-> Seq Scan on names n (cost=0.00..320543.41 rows=14998595 width=18)
Filter: (now() > validfrom)
-> Hash (cost=77028.00..77028.00 rows=5000000 width=10)
-> Seq Scan on people p (cost=0.00..77028.00 rows=5000000
width=10)
SubPlan 1
-> Seq Scan on names n2 (cost=0.00..395543.88 rows=1 width=0)
Filter: ((validfrom > n.validfrom) AND (people_id = p.people_id)
AND (now() > validfrom))

Lateral:

Nested Loop Left Join (cost=358043.67..1790218514528.00 rows=5000000
width=24)
-> Seq Scan on people p (cost=0.00..77028.00 rows=5000000 width=10)
-> Limit (cost=358043.67..358043.67 rows=1 width=18)
-> Sort (cost=358043.67..358043.68 rows=4 width=18)
Sort Key: n.validfrom
-> Seq Scan on names n (cost=0.00..358043.65 rows=4
width=18)
Filter: ((people_id = p.people_id) AND (now() >
validfrom))

Granted I haven't set up any indexes, but it looks like the correlated
subquery, after an initial access of the names table before the join,
accesses the names table again for each (person, name) pair after the join
(in the join filter). So it's worse than just 2 scans per person. Indeed,
the lateral subquery seems better because it accesses the person table and
then the names table once for each person. However, I might instead do
something like this to access each table just once:

explain

select *
from people p
left join (
select *, rank() over(partition by people_id
order by validfrom desc) as rank
from names
) n on p.people_id = n.people_id
and n.rank = 1

QUERY PLAN
Hash Right Join (cost=3427870.26..3942220.56 rows=5000000 width=36)
Hash Cond: (n.people_id = p.people_id)
-> Subquery Scan on n (cost=3263927.26..3751430.31 rows=75000 width=26)
Filter: (n.rank = 1)
-> WindowAgg (cost=3263927.26..3563929.14 rows=15000094 width=18)
-> Sort (cost=3263927.26..3301427.49 rows=15000094 width=18)
Sort Key: names.people_id, names.validfrom
-> Seq Scan on names (cost=0.00..245542.94
rows=15000094 width=18)
-> Hash (cost=77028.00..77028.00 rows=5000000 width=10)
-> Seq Scan on people p (cost=0.00..77028.00 rows=5000000
width=10)

Thanks,
AJ
http://chartio.com/
https://www.linkedin.com/in/ajw0100

On Tue, Dec 9, 2014 at 2:24 AM, Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>
wrote:

> AJ Welch wrote:
> >
> http://blog.heapanalytics.com/postgresqls-powerful-new-join-type-lateral/
> >
> > I suspected some of the claims in the post may not have been accurate.
> This one in particular:
> >
> > "Without lateral joins, we would need to resort to PL/pgSQL to do this
> analysis. Or, if our data set
> > were small, we could get away with complex, inefficient queries."
> >
> >
> > The sum(1) and order by time limit 1 approach seemed less than ideal to
> me and I thought this analysis
> > could be done with normal left joins instead of lateral left joins. So I
> came up with a proof of
> > concept:
> >
> > https://github.com/ajw0100/snippets/tree/master/SQL/lateral
> >
> >
> > Is my conclusion in the README correct? Does anything beyond
> select...from...where force a nested
> > loop? In that case, is lateral really only useful with set returning
> functions as the docs suggest?
> > Does anyone know of any use cases for lateral that do not involve a set
> returning function?
>
> Only recently I used lateral joins to optimize a query.
>
> This is a sample of how the query looked bfore:
>
> SELECT ...
> FROM people p
> LEFT JOIN names n
> ON (n.people_id = p.people_id
> AND current_timestamp > n.validfrom
> AND NOT EXISTS (SELECT 1 FROM names n2
> WHERE n2.people_id = p.people_id
> AND current_timestamp > n2.validfrom
> AND n2.validfrom > n.validfrom)
> )
> WHERE p.id = ...
>
> So basically it is supposed to find the latest valid name for a person.
>
> This required two scans of the "names" table per "person" record.
>
> I rewrote it as
>
> SELECT ...
> FROM people p
> LEFT JOIN LATERAL (SELECT * FROM names n
> WHERE n.people_id = p.people_id
> AND current_timestamp > n.validfrom
> ORDER BY n.validfrom DESC LIMIT 1) n
> ON TRUE
> WHERE p.id = ...
>
> With the correct index this touched fewer blocks and worked faster.
> Also, though this is of course a matter of taste, it is more readable.
>
> Of course this forces a nested loop, but that is not bad as such.
> In my case it was not problem (I tried to hint at that with the WHERE
> clause).
>
> So yes, I think that LATERAL is useful even without set returning
> functions.
>
> Yours,
> Laurenz Albe
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Peter Geoghegan 2014-12-10 01:46:06 Re: Weird CPU utilization patterns with Postgres
Previous Message Adrian Klaver 2014-12-10 00:00:29 Re: pg_restore -n sch1 : schema "sch1" does not exist