FDW join vs full join push down

From: Marc Olivé <marc(dot)olive(at)iomed(dot)es>
To: pgsql-performance(at)postgresql(dot)org
Subject: FDW join vs full join push down
Date: 2021-09-07 14:02:03
Message-ID: CAB7_X5xDKQe0ZwHw4cQ=2aFkjs7R0qHoyvvZigK4E_QBsQffpg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello all,

We have some PG servers which we merge into a "coordinator" node using FDW
and partitioned tables, we partition them by a synthetic "shard_id" field.
There are around 30 tables coordinated this way, with all foreign servers
having the same schema structure.

We have some performance issues when joining foreign tables, always done by
the same "shard_id", where the major bottleneck is how rows from joined
tables are fetched. explain(verbose) shows:

Remote SQL: SELECT entity_id, execution_id, shard_id FROM entity_execution
WHERE ((shard_id = 5)) AND (($1::bigint = entity_id))

This way, PG is doing a lot of round trips between the coordinator and the
foreign nodes, fetching a single row every time, and we have a very high
latency between the coordinator and the nodes.

As the joins are done on the same node, it could send the whole query and
fetch all results in a single round trip.

The FDW are configured with 'use_remote_estimate' to true and we have the
parameters enable_partition_pruning, enable_partitionwise_aggregate and
enable_partitionwise_join activated.
The tables involved can have from a million rows to more than 1000
millions, but the queries usually return a few thousand rows.

A full sample plan and it's query: https://explain.depesz.com/s/TbJy
explain(verbose)
select *
from nlp.note_entity_label nel
join nlp.note_entity ne on ne.note_entity_id = nel.note_entity_id and
ne.shard_id = nel.shard_id
join nlp.note_entity_execution nex on nex.note_entity_id =
ne.note_entity_id and nex.shard_id = nel.shard_id
where
nel.label_id = 192
and nel.shard_id = 5

The row estimates are quite off the true ones, even though we have run
'analyze' on the remote nodes before, and 'use_remote_estimate' is on.
The above query ends in about 6 minutes.

The interesting part is that if we change the 'join' by 'full joins', with
some extra filter, the plan is the one we believe is the optimal one, and
indeed the query ends in 1 second: https://explain.depesz.com/s/b3As

explain(verbose)
with ents as(
select nel.note_entity_id nelid, ne.note_entity_id neid,
nex.note_entity_id nexid, *
from nlp.note_entity_label nel
full join nlp.note_entity ne on ne.note_entity_id = nel.note_entity_id
and ne.shard_id = nel.shard_id
full join nlp.note_entity_execution nex on nex.note_entity_id =
ne.note_entity_id and nex.shard_id = nel.shard_id
where
nel.label_id = 192
and nel.shard_id = 5
)
select *
from ents
where nelid is not null
and neid is not null
and nexid is not null
;

Here we can see that the whole query is sent to the fdw and it finishes in
a reasonable time.

So, the question is if we can do something to make the fdw send the whole
query to the remote nodes when the involved joins use the same partition,
or why isn't PG sending it when we use 'inner join'.
We have tried tweaking the "fdw_tuple_cost" , increasing and lowering it to
unreasonable values
10, 1000, 100000 and 1000000 without the desired result.

Thanks,

Browse pgsql-performance by date

  From Date Subject
Next Message Stepan Yankevych 2021-09-07 18:05:42 Foreign table as partition - Non optimal aggregation plan
Previous Message Andrew Dunstan 2021-09-02 16:59:51 Re: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4