Re: Costing foreign joins in postgres_fdw

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Costing foreign joins in postgres_fdw
Date: 2015-12-21 13:27:47
Message-ID: CAFjFpRcLE225h4aHL1OE3STZ9buAnC2S9dJUG6WCeKp6Cnj8jg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Dec 18, 2015 at 6:39 PM, Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>
wrote:

> Ashutosh Bapat wrote:
> > Costs for foreign queries are either obtained from the foreign server
> using EXPLAIN (if
> > use_remote_estimate is ON) otherwise they are cooked up locally based on
> the statistics available. For
> > joins as well, we have to do the same. If use_remote_estimates [1] is
> ON, we can get the costs from
> > the foreign server. Rest of the mail discusses approaches for estimating
> the costs when
> > use_remote_estimates is OFF.
> >
> >
> > 1. Unlike base relations where the table data "has to be" fetched from
> the foreign server, a join
> > doesn't "have to be" fetched from the foreign server. So, even if
> use_remote_estimate is OFF for a
> > base relation, we do try to estimate something locally. But for a join
> that's not compulsory, so we
> > can choose not to estimate anything and not push down the join if
> use_remote_estimate is OFF. Whether
> > we do that depends upon how well we can estimate the join cost when
> use_remote_estimate is OFF.
> >
> > 2. Locally estimating the cost of join that will be performed on the
> foreign server is difficult
> > because we do not know which join strategy the foreign server is going
> to use, which in turn depends
> > upon the availability of indexes, work memory, statistics about joining
> expressions etc. One way to do
> > this is to use the cost of cheapest local join path built upon foreign
> outer and inner paths, to
> > estimate the cost of executing the join at the foreign server The
> startup and run time costs for
> > sending, parsing and planning query at the foreign server as well as the
> cost to fetch the tuples need
> > to be adjusted, so that it doesn't get counted twice. We may assume that
> the cost for the foreign join
> > will be some factor of the adjusted cost, like we have done for
> estimating cost of sort pushdown. The
> > reason we choose cheapest path with foreign inner and outer paths is
> because that's likely to be a
> > closer to the real estimate than the path which does not have foreign
> inner and outer paths. In the
> > absence of such path, we should probably not push the join down since no
> local path has found pushing
> > inner and outer to be cheaper and it's likely (certainly not a rule)
> that pushing the join in question
> > down is not going to be cheaper than the local paths.
> >
> >
> > 1st option is easy but it sounds too restrictive. 2nd option liberal but
> is complex.
>
> My gut feeling is that for a join where all join predicates can be pushed
> down, it
> will usually be a win to push the join to the foreign server.
>

At least in the first cut, we are planning to push a join down only when
all the "join" predicates are pushable, otherwise, we do not push down the
join.

> So in your first scenario, I'd opt for always pushing down the join
> if possible if use_remote_estimate is OFF.
>

Well, that's what the mail is about. How do we decide whether or not push a
join based on the costs when remote estimation is not possible.

>
> Your second scenario is essentially to estimate that a pushed down join
> will
> always be executed as a nested loop join, which will in most cases produce
> an unfairly negative estimate.
>

No, that's not what is intention behind the second scenario. The cheapest
local strategy needn't always be (in fact it's mostly not) nested loop
join. It could be anything hash join, merge join. The reason I suggested
cheapest local was to give foreign join an advantage over all the local
paths created. If we cost the foreign join on fairly lower side compared to
all the local paths and then add transfer costs, that might give foreign
join an added advantage. But relying on local cheapest strategy may cause
the costs to sway a lot because of inaccuracies in local statistics. That's
where I think your next idea helps.

>
> What about using local statistics to come up with an estimated row count
> for
> the join and use that as the basis for an estimate? My idea here is that
> it
> is always be a win to push down a join unless the result set is so large
> that
> transferring it becomes the bottleneck.
> Maybe, to come up with something remotely realistic, a formula like
>
> sum of locally estimated costs of sequential scan for the base table
> plus count of estimated result rows (times a factor)
>

Cost of foreign join = cost of scanning all the involved base relations +
cost of applying quals + cost of transferring result of the join. The join
is planned two relations (base or join) at a time, so we should be able to
compute this cost recursively, rather than grabbing cost of scanning base
relations every time. That will also work if part of the join tree is built
with use_remote_estimate ON and part without.

>
> Yours,
> Laurenz Albe
>
>
>

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Teodor Sigaev 2015-12-21 14:18:37 Re: Patch: Implement failover on libpq connect level.
Previous Message Simon Riggs 2015-12-21 13:06:11 Re: ToDo list update for BRIN indexes