From: | "Fujii(dot)Yuki(at)df(dot)MitsubishiElectric(dot)co(dot)jp" <Fujii(dot)Yuki(at)df(dot)MitsubishiElectric(dot)co(dot)jp> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alexander Pyhalov <a(dot)pyhalov(at)postgrespro(dot)ru>, Bruce Momjian <bruce(at)momjian(dot)us>, Stephen Frost <sfrost(at)snowman(dot)net>, "Finnerty, Jim" <jfinnert(at)amazon(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, "Fujii(dot)Yuki(at)df(dot)MitsubishiElectric(dot)co(dot)jp" <Fujii(dot)Yuki(at)df(dot)MitsubishiElectric(dot)co(dot)jp> |
Subject: | RE: Partial aggregates pushdown |
Date: | 2023-12-06 08:41:21 |
Message-ID: | TYAPR01MB551411A7DDFF0B38C5AAEDFE9584A@TYAPR01MB5514.jpnprd01.prod.outlook.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi Mr.Haas, hackers.
> From: Robert Haas <robertmhaas(at)gmail(dot)com>
> Sent: Tuesday, November 28, 2023 5:03 AM
> Also, I want to make one other point here about security and reliability. Right now, there is no way for a user to feed
> arbitrary data to a deserialization function. Since serialization and deserialization functions are only used in the context of
> parallel query, we always know that the data fed to the deserialization function must have come from the serialization
> function on the same machine. Nor can users call the deserialization function directly with arbitrary data of their own
> choosing, because users cannot call functions that take or return internal. But with this system, it becomes possible to
> feed arbitrary data to a deserialization function.
> The user could redefine the function on the remote side so that it produces arbitrary data of their choosing, and the local
> deserialization function will ingest it.
>
> That's potentially quite a significant problem. Consider for example that numericvar_deserialize() does no validity
> checking on any of the weight, sign, or dscale, but not all values for those fields are legal. Right now that doesn't matter,
> but if you can feed arbitrary data to that function, then it is. I don't know exactly what the consequences are if you can get
> that function to spit out a NumericVar with values outside the normal legal range. What can you do then?
> Store a bogus numeric on disk? Crash the server? Worst case, some problem like this could be a security issue allowing for
> escalation to superuser; more likely, it would be a crash bug, corrupt your database, or lead to unexpected and strange
> error messages.
>
> Unfortunately, I have the unpleasant suspicion that most internal-type aggregates will be affected by this problem.
> Consider, for example, string_agg_deserialize(). Generally, strings are one of the least-constrained data types, so you
> might hope that this function would be OK. But it doesn't look very promising. The first int4 in the serialized representation
> is the cursor, which would have to be bounds-checked, lest someone provide a cursor that falls outside the bounds of the
> StringInfo and, maybe, cause a reference to an arbitrary memory location. Then the rest of the representation is the actual
> data, which could be anything. This function is used for both bytea and for text, and for bytea, letting the payload be
> anything is OK.
> But for text, the supplied data shouldn't contain an embedded zero byte, or otherwise be invalid in the server encoding. If
> it were, that would provide a vector to inject invalidly encoded data into the database. This feature can't be allowed to do
> that.
I completely overlooked this issue. I should have considered the risks of sending raw state values or serialized state
data directly from remote to local. I apologize.
> What could be a solution to this class of problems? One option is to just give up on supporting this feature for internal-type
> aggregates for now. That's easy enough to do, and just means we have less functionality, but it's sad because that's
> functionality we'd like to have. Another approach is to add necessary sanity checks to the relevant deserialization
> functions, but that seems a little hard to get right, and it would slow down parallel query cases which are probably going to
> be more common than the use of this feature. I think the slowdown might be significant, too. A third option is to change
> those aggregates in some way, like giving them a transition function that operates on some data type other than internal,
> but there again we have to be careful of slowdowns. A final option is to rethink the infrastructure in some way, like having
> a way to serialize to something other than bytea, for which we already have input functions with adequate checks. For
> instance, if string_agg_serialize() produced a record containing an integer column and a text or bytea column, we could
> attempt to ingest that record on the other side and presumably the right things would happen in the case of any invalid
> data. But I'm not quite sure what infrastructure would be required to make this kind of idea work.
Thank you very much for providing a direction towards resolving this issue.
As you have suggested as the last option, it seems that expanding the current mechanism of the aggregation
function is the only choice. It may take some time, but I will consider specific solutions.
> From: Robert Haas <robertmhaas(at)gmail(dot)com>
> Sent: Tuesday, November 28, 2023 4:08 AM
> On Wed, Nov 22, 2023 at 1:32 AM Alexander Pyhalov <a(dot)pyhalov(at)postgrespro(dot)ru> wrote:
> > Hi. HAVING is also a problem. Consider the following query
> >
> > SELECT count(a) FROM t HAVING count(a) > 10 - we can't push it down to
> > foreign server as HAVING needs full aggregate result, but foreign
> > server don't know it.
>
> I don't see it that way. What we would push to the foreign server would be something like SELECT count(a) FROM t. Then,
> after we get the results back and combine the various partial counts locally, we would locally evaluate the HAVING clause
> afterward. That is, partial aggregation is a barrier to pushing down HAVING clause itself, but it doesn't preclude pushing
> down the aggregation.
I understand what the problem is. I will try to fix it in the next version.
> From: Robert Haas <robertmhaas(at)gmail(dot)com>
> Sent: Tuesday, November 28, 2023 5:03 AM
> On Wed, Nov 22, 2023 at 5:16 AM Fujii(dot)Yuki(at)df(dot)MitsubishiElectric(dot)co(dot)jp
> <Fujii(dot)Yuki(at)df(dot)mitsubishielectric(dot)co(dot)jp> wrote:
> > I did not choose Approach2 because I was not confident that the
> > disadvantage mentioned in 2.(2)(a) would be accepted by the PostgreSQL development community.
> > If it is accepted, I think Approach 2 is smarter.
> > Could you please provide your opinion on which approach is preferable
> > after comparing these two approaches?
> > If we cannot say anything without comparing the amount of source code,
> > as Mr.Momjian mentioned, we need to estimate the amount of source code required to implement Approach2.
>
> I've had the same concern, that approach #2 would draw objections, so I think you were right to be cautious about it. I
> don't think it is a wonderful approach in all ways, but I do think that it is superior to approach #1. If we add dedicated
> support to the grammar, it is mostly a one-time effort, and after that, there should not be much need for anyone to be
> concerned about it. If we instead add extra aggregates, then that generates extra work every time someone writes a patch
> that adds a new aggregate to core. I have a difficult time believing that anyone will prefer an approach that involves an
> ongoing maintenance effort of that type over one that doesn't.
>
> One point that seems to me to be of particular importance is that if we add new aggregates, there is a risk that some
> future aggregate might do that incorrectly, so that the main aggregate works, but the secondary aggregate created for this
> feature does not work. That seems like it would be very frustrating for future code authors so I'd like to avoid the risk as
> much as we can.
Are you concerned about the hassle and potential human errors of manually adding new partial
aggregation functions, rather than the catalog becoming bloated?
The process of creating partial aggregation functions from aggregation functions can be automated,
so I believe this issue can be resolved. However, automating it may increase the size of the patch
even more, so overall, approach#2 might be better.
To implement approach #2, it would be necessary to investigate how much additional code is required.
Sincerely yours,
Yuuki Fujii
--
Yuuki Fujii
Information Technology R&D Center Mitsubishi Electric Corporation
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2023-12-06 08:56:21 | Re: logical decoding and replication of sequences, take 2 |
Previous Message | Peter Eisentraut | 2023-12-06 08:23:07 | Re: tablecmds.c/MergeAttributes() cleanup |