Re: postgres_fdw bug in 9.6

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: postgres_fdw bug in 9.6
Date: 2016-12-12 22:45:36
Message-ID: 13242.1481582736@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> I have a test case where I made the fdw connect back to itself, and
> stripped out all the objects that I could and still reproduce the case. It
> is large, 21MB compressed, 163MB uncompressed, so I am linking it here:
> https://drive.google.com/open?id=0Bzqrh1SO9FcEZkpPM0JwUEMzcmc

Thanks for the test case. I believe I understand the fundamental problem,
or one of the fundamental problems --- I'm not exactly convinced there
aren't several here.

The key issue is that GetExistingLocalJoinPath believes it can return
any random join path as what to use for the fdw_outerpath of a foreign
join. You can get away with that, perhaps, as long as you only consider
two-way joins. But in a nest of foreign joins, this fails to consider
the interrelationships of join paths and their children. In particular,
what we have happening in this example is that GetExistingLocalJoinPath
seizes randomly on a MergePath for an upper join relation, clones it,
sees that the outer child is a join ForeignPath, and replaces that outer
child with its fdw_outerpath ... which was chosen equally cavalierly by
some previous execution of GetExistingLocalJoinPath, and does not have
the sort ordering expected by the MergePath. So we'd generate an invalid
EPQ plan, except that the debug cross-check in create_mergejoin_plan
notices the discrepancy.

I believe there are probably more problems here, or at least if there
aren't, it's not clear why not. Because of GetExistingLocalJoinPath's
lack of curiosity about what's underneath the join pathnode it picks,
it seems to me that it's possible for it to return a path tree that
*isn't* all local joins. If we're looking at, say, a hash path for
a 4-way join, whose left input is a hash path for a 3-way join, whose
left input is a 2-way foreign join, what's stopping that from being
returned as a "local" path tree?

Likewise, it seems like the code is trying to reject any custom-path join
types, or at least this barely-intelligible comment seems to imply that:

* Just skip anything else. We don't know if corresponding
* plan would build the output row from whole-row references
* of base relations and execute the EPQ checks.

But this coding fails to notice any such join type that's below the
level of the immediate two join inputs.

We've probably managed to not notice this so far because foreign joins
generally ought to dominate any local join method, so that there wouldn't
often be cases where the surviving paths use local joins for input
sub-joins. But Jeff's test case proves it can happen.

I kind of wonder why this infrastructure exists at all; it's not the way
I'd have foreseen handling EPQ for remote joins. However, while "throw
it away and start again" might be good advice going forward, I suppose
it won't be very popular for applying to 9.6.

One way that we could make things better is to rely on the knowledge
that EPQ isn't asked to evaluate joins for more than one row per input
relation, and therefore using merge or hash join technology is really
overkill. We could make a tree of simple nestloop joins, which aren't
going to care about sort order, if we could identify the correct join
clauses to apply. At least some of that could be laid off on the FDW,
which if it's gotten this far at all, ought to know what join clauses
need to be enforced by the foreign join. So I'm thinking a little bit
in terms of "just collect the foreign scans for all the base rels
included in the join and then build a cross-product nestloop join tree,
applying all the join clauses at the top". This would have the signal
value that it's guaranteed to succeed and so can be left for later,
rather than having to expensively redo it at each level of join planning.

(Hm, that does sound a lot like "throw it away and start again", doesn't
it. But what we've got here is busted enough that I'm not sure there
are good alternatives. Maybe for 9.6 we could just rewrite
GetExistingLocalJoinPath, and limp along doing a lot of redundant
computation during planning.)

BTW, what's "existing" about the result of GetExistingLocalJoinPath?
And for that matter, what's "outer" about the content of fdw_outerpath?
Good luck figuring that out from the documentation of the field, all
zero words of it.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2016-12-12 22:46:35 Re: [OSSTEST PATCH 0/1] PostgreSQL db: Retry on constraint violation
Previous Message Tomas Vondra 2016-12-12 22:39:34 Re: PATCH: two slab-like memory allocators