From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Ashutosh Bapat <ashutosh(dot)bapat(at)2ndquadrant(dot)com> |
Cc: | "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stephen Frost <sfrost(at)snowman(dot)net>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> |
Subject: | Re: Ideas about a better API for postgres_fdw remote estimates |
Date: | 2020-09-08 21:05:30 |
Message-ID: | 20200908210530.2rle7dzd2itrkhxv@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Sep 08, 2020 at 05:55:09PM +0530, Ashutosh Bapat wrote:
>On Fri, 4 Sep 2020 at 20:27, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
>wrote
>
>>
>>
>> 4) I wonder if we actually want/need to simply output pg_statistic data
>> verbatim like this. Is postgres_fdw actually going to benefit from it? I
>> kinda doubt that, and my assumption was that we'd return only a small
>> subset of the data, needed by get_remote_estimate.
>>
>> This has a couple of issues. Firstly, it requires the knowledge of what
>> the stakind constants in pg_statistic mean and how to interpret it - but
>> OK, it's true that does not change very often (or at all). Secondly, it
>> entirely ignores extended statistics - OK, we might extract those too,
>> but it's going to be much more complex. And finally it entirely ignores
>> costing on the remote node. Surely we can't just apply local
>> random_page_cost or whatever, because those may be entirely different.
>> And we don't know if the remote is going to use index etc.
>>
>> So is extracting data from pg_statistic the right approach?
>>
>>
>There are two different problems, which ultimately might converge.
>1. If use_remote_estimates = false, more generally if querying costs from
>foreign server for costing paths is impractical, we want to use local
>estimates and try to come up with costs. For that purpose we keep some
>statistics locally and user is expected to refresh it periodically by
>running ANALYZE on the foreign table. This patch is about a. doing this
>efficiently without requiring to fetch every row from the foreign server b.
>through autovacuum automatically without user firing ANALYZE. I think this
>also answers your question about vacuum_rel() above.
>
>2. How to efficiently extract costs from an EXPLAIN plan when
>use_remote_eestimates is true. That's the subject of some nearby thread. I
>think you are referring to that problem here. Hence your next point.
>
I think that was the topic of *this* thread as started by Tom, but I now
realize Andrey steered it in the direction to allow re-using remote
stats. Which seems useful too, but it confused me a bit.
>Using EXPLAIN to get costs from the foreign server isn't efficient. It
>increases planning time a lot; sometimes planning time exceeds execution
>time. If usage of foreign tables becomes more and more common, this isn't
>ideal. I think we should move towards a model in which the optimizer can
>decide whether a subtree involving a foreign server should be evaluated
>locally or on the foreign server without the help of foreign server. One
>way to do it (I am not saying that this is the only or the best way) is to
>estimate the cost of foreign query locally based on the information
>available locally about the foreign server and foreign table. This might
>mean that we have to get that information from the foreign server and cache
>it locally and use it several times, including the indexes on foreign
>table, values of various costs etc. Though this approach doesn't solve all
>of those problems it's one step forward + it makes the current scenario
>also efficient.
>
True, but that ptoject is way more ambitious than providing a simple API
for postgres_fdw to obtain the estimates more efficiently.
>I agree that the patch needs some work though, esp the code dealing with
>serialization and deserialization of statistics.
I think there's a bunch of open questions, e.g. what to do with extended
statistics - for example what should happen when the extended statistics
object is defined only on local/remote server, or when the definitions
don't match? What should happen when the definitions don't match? This
probably is not an issue for "regular" stats, because that seems pretty
stable, but for extended stats there are differences between versions.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2020-09-08 21:36:02 | Re: VACUUM (INTERRUPTIBLE)? |
Previous Message | Heikki Linnakangas | 2020-09-08 20:37:06 | Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays |