From: | "Joel Jacobson" <joel(at)compiler(dot)org> |
---|---|
To: | "Tomas Vondra" <tomas(dot)vondra(at)enterprisedb(dot)com>, "jian he" <jian(dot)universality(at)gmail(dot)com> |
Cc: | "Tom Dunstan" <pgsql(at)tomd(dot)cc>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Do we want a hashset type? |
Date: | 2023-06-22 05:51:28 |
Message-ID: | d4c36bc8-2c15-40bc-a35f-3489dd8a31dd@app.fastmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Jun 20, 2023, at 18:25, Tomas Vondra wrote:
> On 6/20/23 16:56, Joel Jacobson wrote:
>> The reference to consistency with what we do elsewhere might not be entirely
>> applicable in this context, since the set feature we're designing is a new beast
>> in the SQL landscape.
>
> I don't see how it's new, considering relational algebra is pretty much
> based on (multi)sets, and the three-valued logic with NULL values is
> pretty well established part of that.
What I meant was that the SET-feature is new; since it doesn't exist in PostgreSQL nor SQL.
>> I think adhering to the theoretical purity of sets by excluding NULLs aligns us
>> with set theory, simplifies our code, and parallels set implementations in other
>> languages.
>
> I don't see how that would be more theoretically pure, really. The
> three-valued logic is a well established part of relational algebra, so
> not respecting that is more a violation of the purity.
Hmm, I think it's pure in different ways;
Set Theory is well established and is based on two-values logic,
but at the same time SQL's three-valued logic is also well established.
>> I think we have an opportunity here to innovate and potentially influence a
>> future set concept in the SQL standard.
>
> I doubt this going to influence what the SQL standard says, especially
> because it already defined the behavior for MULTISETS (of which the sets
> are a special case, pretty much). So this has 0% chance of success.
OK. 0% is 1% too low for me to work with. :)
>> However, I see how one could argue against this reasoning, on the basis that
>> PostgreSQL users might be more familiar with and expect NULLs can exist
>> everywhere in all data structures.
>
> Right, it's what we already do for similar cases, and if you have NULLS
> in the data, you better be aware of the behavior. Granted, some people
> are surprised by three-valued logic, but using a different behavior for
> some new features would just increase the confusion.
Good point.
>> I've been trying hard, but I can't find compelling use-cases where a NULL element
>> in a set would offer a more natural SQL query than handling NULLs within SQL and
>> keeping the set NULL-free.
>
> IMO if you have NULL values in the data, you better be aware of it and
> handle the case accordingly (e.g. by filtering them out when building
> the set). If you don't have NULLs in the data, there's no issue.
As long as the data model and queries would ensure there can never be
any NULLs, fine, then there's is no issue.
> And in the graph case, I don't see why you'd have any NULLs, considering
> we're dealing with adjacent nodes, and if there's adjacent node, it's ID
> is not NULL.
Me neither, can't see the need for any NULLs there.
>> Does anyone else have a strong realistic example where including NULLs in the
>> set would simplify the SQL query?
>
> I'm sure there are cases where you have NULLs in the dat aand need to
> filter them out, but that's just natural consequence of having NULLs. If
> you have them you better know what NULLs do ...
What I tried to find was an example for was when you wouldn't want to
filter out the NULLs, when you would want to include the NULL
in the set.
If we could just find one should realistic use-case, that would be very
helpful, since it would then kill my argument completely that we couldn't
do without storing a NULL in the set.
> It's too early to make any strong statements, but it's going to be hard
> to convince me we should handle NULLs differently from what we already
> do elsewhere.
I think it's a trade-off, and I don't have any strong preference for the simplicity
of a classical two-valued set-theoretic system vs a three-valued
multiset-based one. I was 51/49 but given your feedback I'm now 49/51.
I think the next step is to think about how the hashset type should work
with three-valued logic, and then implement it to get a feeling for it.
For instance, how should hashset_count() work?
Given the query,
SELECT hashset_count('{1,2,3,null}'::int4hashset);
Should we,
a) threat NULL as a distinct value and return 4?
b) ignore NULL and return 3?
c) return NULL? (since the presence of NULL can be thought to render the entire count indeterminate)
I think my personal preference is (b) since it is then consistent with how COUNT() works.
/Joel
From | Date | Subject | |
---|---|---|---|
Next Message | Yugo NAGATA | 2023-06-22 05:58:14 | Re: Make pgbench exit on SIGINT more reliably |
Previous Message | Fujii.Yuki@df.MitsubishiElectric.co.jp | 2023-06-22 05:23:33 | RE: Partial aggregates pushdown |