Re: Question about optimising (Postgres_)FDW

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Hannu Krosing <hannu(at)krosing(dot)net>
Cc: Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Question about optimising (Postgres_)FDW
Date: 2014-04-17 05:50:41
Message-ID: CAFjFpRe+HwbSauaDsgHXtD6_4duBo-ULkbHiO78-sRE56_T9Tg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

AFAIK, PostgreSQL's join nodes (except for hash join) consider one row at a
time from outer table and match inner table rows one at a time. What needs
to be done in the case you are suggesting is that it needs to consider all
the rows of outer table, fetch their respective joining columns and then
pass that information down to inner side. The inner side then would give a
bunch of rows qualifying the join condition. Join this set with outer rows
again.

For an equality operator, this might be possible in Hash join but for other
operator, hash join won't work. Thus for other operators, we will need to
materialize the outer result, which seems to have its cost, which needs to
be factored. Lot of changes, but those may be worth it, for foreign scans
with high connection costs.

On Wed, Apr 16, 2014 at 9:40 PM, Hannu Krosing <hannu(at)krosing(dot)net> wrote:

> On 04/16/2014 03:16 PM, Hannu Krosing wrote:
> > On 04/16/2014 01:35 PM, Etsuro Fujita wrote:
> >> (2014/04/16 6:55), Hannu Krosing wrote:
> > ...
> >> Maybe I'm missing something, but I think that you can do what I think
> >> you'd like to do by the following procedure:
> > No, what I'd like PostgreSQL to do is to
> >
> > 1. select the id+set from local table
> > 2. select the rows from remote table with WHERE ID IN (<set selected in
> > step 1>)
> > 3. then join the original set to selected set, with any suitable join
> > strategy
> >
> > The things I do not want are
> >
> > A. selecting all rows from remote table
> > (this is what your examples below do)
> >
> > or
> >
> > B. selecting rows from remote table by single selects using "ID = $"
> > (this is something that I managed to do by some tweaking of costs)
> >
> > as A will be always slow if there are millions of rows in remote table
> > and B is slow(ish) when the idset is over a few hundred ids
> >
> > I hope this is a bit better explanation than I provided before .
> >
> > Cheers
> > Hannu
> >
> > P.S. I am not sure if this is a limitation of postgres_fdw or postgres
> > itself
> >
> > P.P.S I tested a little with with Multicorn an postgresql did not
> > request row
> > counts for any IN plans, so it may be that the planner does not consider
> > this
> > kind of plan at all. (testing was on PgSQL 9.3.4)
> >
> > Hannu
> Also a sample run of the two plans to illustrate my point
>
> How it is run now:
>
> testdb=# explain analyse verbose
> select r.data, l.data
> from onemillion_pgfdw r
> join onemillion l
> on r.id = l.id and l.id between 100000 and 100100;
>
> QUERY
> PLAN
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------
> Hash Join (cost=111.61..198.40 rows=1 width=16) (actual
> time=7534.360..8731.541 rows=101 loops=1)
> Output: r.data, l.data
> Hash Cond: (r.id = l.id)
> -> Foreign Scan on public.onemillion_pgfdw r (cost=100.00..178.25
> rows=2275 width=12) (actual time=1.628..8364.688 rows=1000000 loops=1)
> Output: r.id, r.inserted, r.data
> Remote SQL: SELECT id, data FROM public.onemillion
> -> Hash (cost=10.39..10.39 rows=98 width=12) (actual
> time=0.179..0.179 rows=101 loops=1)
> Output: l.data, l.id
> Buckets: 1024 Batches: 1 Memory Usage: 5kB
> -> Index Scan using onemillion_pkey on public.onemillion l
> (cost=0.42..10.39 rows=98 width=12) (actual time=0.049..0.124 rows=101
> loops=1)
> Output: l.data, l.id
> Index Cond: ((l.id >= 100000) AND (l.id <= 100100))
> Total runtime: 8732.213 ms
> (13 rows)
>
> Time: 8733.799 ms
>
>
> And how the above query should be planned/executed:
>
> testdb=# explain analyse verbose
> select r.data, l.data
> from (select * from onemillion_pgfdw where id = any (array(select id
> from onemillion where id between 100000 and 100100))) r
> join onemillion l
> on r.id = l.id;
>
> QUERY
> PLAN
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> Nested Loop (cost=110.81..1104.30 rows=111 width=16) (actual
> time=2.756..3.738 rows=101 loops=1)
> Output: onemillion_pgfdw.data, l.data
> InitPlan 1 (returns $0)
> -> Index Only Scan using onemillion_pkey on public.onemillion
> (cost=0.42..10.39 rows=98 width=4) (actual time=0.055..0.118 rows=101
> loops=1)
> Output: onemillion.id
> Index Cond: ((onemillion.id >= 100000) AND (onemillion.id <=
> 100100))
> Heap Fetches: 101
> -> Foreign Scan on public.onemillion_pgfdw (cost=100.00..163.41
> rows=111 width=12) (actual time=2.729..3.012 rows=101 loops=1)
> Output: onemillion_pgfdw.id, onemillion_pgfdw.inserted,
> onemillion_pgfdw.data
> Remote SQL: SELECT id, data FROM public.onemillion WHERE ((id =
> ANY ($1::integer[])))
> -> Index Scan using onemillion_pkey on public.onemillion l
> (cost=0.42..8.37 rows=1 width=12) (actual time=0.005..0.006 rows=1
> loops=101)
> Output: l.id, l.inserted, l.data
> Index Cond: (l.id = onemillion_pgfdw.id)
> Total runtime: 4.469 ms
> (14 rows)
>
> Time: 6.437 ms
>
>
>
>
> >> postgres=# ALTER SERVER loop OPTIONS (ADD fdw_startup_cost '1000');
> >> ALTER SERVER
> >> postgres=# EXPLAIN VERBOSE SELECT * FROM onemillion_pgsql WHERE id in
> >> (SELECT id FROM onemillion WHERE data > '0.9' LIMIT 100);
> >> QUERY PLAN
> >>
> -----------------------------------------------------------------------------------------------
> >>
> >> Hash Semi Join (cost=1023.10..41983.21 rows=100 width=30)
> >> Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
> >> onemillion_pgsql.data
> >> Hash Cond: (onemillion_pgsql.id = onemillion.id)
> >> -> Foreign Scan on public.onemillion_pgsql
> >> (cost=1000.00..39334.00 rows=1000000 width=29)
> >> Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
> >> onemillion_pgsql.data
> >> Remote SQL: SELECT id, inserted, data FROM public.onemillion
> >> -> Hash (cost=21.85..21.85 rows=100 width=4)
> >> Output: onemillion.id
> >> -> Limit (cost=0.00..20.85 rows=100 width=4)
> >> Output: onemillion.id
> >> -> Seq Scan on public.onemillion (cost=0.00..20834.00
> >> rows=99918 width=4)
> >> Output: onemillion.id
> >> Filter: (onemillion.data > '0.9'::text)
> >> Planning time: 0.690 ms
> >> (14 rows)
> >>
> >> or, that as Tom mentioned, by disabling the use_remote_estimate
> function:
> >>
> >> postgres=# ALTER FOREIGN TABLE onemillion_pgsql OPTIONS (SET
> >> use_remote_estimate 'false');
> >> ALTER FOREIGN TABLE
> >> postgres=# EXPLAIN VERBOSE SELECT * FROM onemillion_pgsql WHERE id in
> >> (SELECT id FROM onemillion WHERE data > '0.9' LIMIT 100);
> >> QUERY PLAN
> >>
> ----------------------------------------------------------------------------------------------
> >>
> >> Hash Semi Join (cost=123.10..41083.21 rows=100 width=30)
> >> Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
> >> onemillion_pgsql.data
> >> Hash Cond: (onemillion_pgsql.id = onemillion.id)
> >> -> Foreign Scan on public.onemillion_pgsql (cost=100.00..38434.00
> >> rows=1000000 width=30)
> >> Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
> >> onemillion_pgsql.data
> >> Remote SQL: SELECT id, inserted, data FROM public.onemillion
> >> -> Hash (cost=21.85..21.85 rows=100 width=4)
> >> Output: onemillion.id
> >> -> Limit (cost=0.00..20.85 rows=100 width=4)
> >> Output: onemillion.id
> >> -> Seq Scan on public.onemillion (cost=0.00..20834.00
> >> rows=99918 width=4)
> >> Output: onemillion.id
> >> Filter: (onemillion.data > '0.9'::text)
> >> Planning time: 0.215 ms
> >> (14 rows)
> >>
> >> Thanks,
> >>
> >> Best regards,
> >> Etsuro Fujita
> >>
> >>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message sure.postgres 2014-04-17 06:39:45 Re: The question about the type numeric
Previous Message Michael Paquier 2014-04-17 05:29:10 Re: New functions for sslinfo extension