From: | David Rowley <dgrowleyml(at)gmail(dot)com> |
---|---|
To: | Michael Paquier <michael(at)paquier(dot)xyz> |
Cc: | Bykov Ivan <i(dot)bykov(at)modernsys(dot)ru>, Sami Imseih <samimseih(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET |
Date: | 2025-03-12 02:28:27 |
Message-ID: | CAApHDvqxB0HiBhz9bPpdT1qjUG_=wnpAX21Wu6CWh5UsRno9_Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, 11 Mar 2025 at 23:28, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> > Can you share your thoughts on Ivan's NULL jumble idea?
>
> This is an incomplete solution, I am afraid. The origin of the
> problem comes from the fact that a Node (here, Query) stores two times
> successively equivalent Nodes that are used for separate data, with
> NULL being the difference between both. The problem is not limited to
> NULL, we could as well, for example, finish with a Node that has a
> custom jumbling function and the same issue depending on how it is
> used in a parent Node. Adding the offset from the parent in the
> jumbling is a much stronger insurance policy that the proposed
> solution specific to NULL, because each offset is unique in a single
> Node.
One of us is not understanding the problem correctly. I'm very open
for that to be me, but I'll need a bit more explanation for me to know
that for sure.
As far as I see it, providing we ensure we always add something with a
bit of entropy to the jumble buffer for all possible values that a
jumble field can be, this fixes the issue. The problem only occurs at
the moment because we entirely skip over NULL node fields and we end
up with the same jumble bytes if we effectively move the same List of
nodes between the Query's distinctClause and sortClause fields.
If we add something for NULLs, we'll always add distinctClause before
the sortClause. If the distinctClause is NULL we won't end up with the
same jumble bytes as if the sortClause were NULL, as the order the
NULL byte(s) are added to the buffer changes.
The fragment of the Query struct looks like:
List *windowClause; /* a list of WindowClause's */
List *distinctClause; /* a list of SortGroupClause's */
List *sortClause; /* a list of SortGroupClause's */
Let's assume we have a non-NIL windowClause, a non-NIL distinctClause
and a NIL sortClause. The jumble bytes currently look like:
"<windowClause bytes><distinctClause bytes>", and if we have an ORDER
BY instead of a DISTINCT, we get: "<windowClause bytes><sortClause
bytes>", and this is problematic when the distinctClause bytes and
sortClause bytes at the same as that results in the same hash.
If we account for the NULLs, those two cases become: "<windowClause
bytes><distinctClause bytes><byte for NULL sortClause>" and
"<windowClause bytes><byte for NULL distinct clause><sortClause
bytes>", which is going to be highly unlikely to hash to the same
value due to the buffer not being the same.
Can you explain where my understanding is wrong?
David
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Langote | 2025-03-12 03:07:08 | Re: Question about duplicate JSONTYPE_JSON check |
Previous Message | Noah Misch | 2025-03-12 01:23:15 | Re: Back-patch of: avoid multiple hard links to same WAL file after a crash |