From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: postgres_fdw: using TABLESAMPLE to collect remote sample |
Date: | 2022-07-16 21:57:44 |
Message-ID: | 3998908.1658008664@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Andres Freund <andres(at)anarazel(dot)de> writes:
> On 2022-02-23 00:51:24 +0100, Tomas Vondra wrote:
>> And here's the slightly simplified patch, without the part adding the
>> unnecessary GetServerVersion() function.
> Doesn't apply anymore: http://cfbot.cputube.org/patch_37_3535.log
> Marked as waiting-on-author.
Here's a rebased version that should at least pass regression tests.
I've not reviewed it in any detail, but:
* I'm not really on board with defaulting to SYSTEM sample method,
and definitely not on board with not allowing any other choice.
We don't know enough about the situation in a remote table to be
confident that potentially-nonrandom sampling is OK. So personally
I'd default to BERNOULLI, which is more reliable and seems plenty fast
enough given your upthread results. It could be an idea to extend the
sample option to be like "sample [ = methodname ]", if you want more
flexibility, but I'd be happy leaving that for another time.
* The code depending on reltuples is broken in recent server versions,
and might give useless results in older ones too (if reltuples =
relpages = 0). Ideally we'd retrieve all of reltuples, relpages, and
pg_relation_size(rel), and do the same calculation the planner does.
Not sure if pg_relation_size() exists far enough back though.
* Copying-and-pasting all of deparseAnalyzeSql (twice!) seems pretty
bletcherous. Why not call that and then add a WHERE clause to its
result, or just add some parameters to it so it can do that itself?
* More attention to updating relevant comments would be appropriate,
eg here you've not bothered to fix the adjacent falsified comment:
/* We've retrieved all living tuples from foreign server. */
- *totalrows = astate.samplerows;
+ if (do_sample)
+ *totalrows = reltuples;
+ else
+ *totalrows = astate.samplerows;
* Needs docs obviously. I'm not sure if the existing regression
testing covers the new code adequately or if we need more cases.
Having said that much, I'm going to leave it in Waiting on Author
state.
regards, tom lane
Attachment | Content-Type | Size |
---|---|---|
0001-postgres_fdw-sample-data-on-remote-node-for-20220716.patch | text/x-diff | 13.6 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | David G. Johnston | 2022-07-16 23:50:28 | Re: Move Section 9.27.7 (Data Object Management Functions) to System Information Chapter |
Previous Message | Thomas Munro | 2022-07-16 21:56:27 | Re: Proposal to introduce a shuffle function to intarray extension |