Re: Do we want a hashset type?

From: jian he <jian(dot)universality(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Joel Jacobson <joel(at)compiler(dot)org>, Tom Dunstan <pgsql(at)tomd(dot)cc>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Do we want a hashset type?
Date: 2023-06-20 18:08:48
Message-ID: CACJufxFR3yT_Qf0J43hs05+XLts1M2kBARexcf9X85A4GJcyOw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 21, 2023 at 12:25 AM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
>
>
> On 6/20/23 16:56, Joel Jacobson wrote:
> > On Tue, Jun 20, 2023, at 14:10, Tomas Vondra wrote:
> >> On 6/20/23 12:59, Joel Jacobson wrote:
> >>> On Mon, Jun 19, 2023, at 02:00, jian he wrote:
> >>>> select hashset_contains('{1,2}'::int4hashset,NULL::int);
> >>>> should return null?
> >>>
> >>> I agree, it should.
> >>>
> >>> I've now changed all functions except int4hashset() (the init function)
> >>> and the aggregate functions to be STRICT.
> >>
> >> I don't think this is correct / consistent with what we do elsewhere.
> >> IMHO it's perfectly fine to have a hashset containing a NULL value,
> >
> > The reference to consistency with what we do elsewhere might not be entirely
> > applicable in this context, since the set feature we're designing is a new beast
> > in the SQL landscape.
> >
>
> I don't see how it's new, considering relational algebra is pretty much
> based on (multi)sets, and the three-valued logic with NULL values is
> pretty well established part of that.
>
> > I think adhering to the theoretical purity of sets by excluding NULLs aligns us
> > with set theory, simplifies our code, and parallels set implementations in other
> > languages.
> >
>
> I don't see how that would be more theoretically pure, really. The
> three-valued logic is a well established part of relational algebra, so
> not respecting that is more a violation of the purity.
>
> > I think we have an opportunity here to innovate and potentially influence a
> > future set concept in the SQL standard.
> >
>
> I doubt this going to influence what the SQL standard says, especially
> because it already defined the behavior for MULTISETS (of which the sets
> are a special case, pretty much). So this has 0% chance of success.
>
> > However, I see how one could argue against this reasoning, on the basis that
> > PostgreSQL users might be more familiar with and expect NULLs can exist
> > everywhere in all data structures.
> >
>
> Right, it's what we already do for similar cases, and if you have NULLS
> in the data, you better be aware of the behavior. Granted, some people
> are surprised by three-valued logic, but using a different behavior for
> some new features would just increase the confusion.
>
> > A different perspective is to look at what use-cases we can foresee.
> >
> > I've been trying hard, but I can't find compelling use-cases where a NULL element
> > in a set would offer a more natural SQL query than handling NULLs within SQL and
> > keeping the set NULL-free.
> >
>
> IMO if you have NULL values in the data, you better be aware of it and
> handle the case accordingly (e.g. by filtering them out when building
> the set). If you don't have NULLs in the data, there's no issue.
>
> And in the graph case, I don't see why you'd have any NULLs, considering
> we're dealing with adjacent nodes, and if there's adjacent node, it's ID
> is not NULL.
>
> > Does anyone else have a strong realistic example where including NULLs in the
> > set would simplify the SQL query?
> >
>
> I'm sure there are cases where you have NULLs in the dat aand need to
> filter them out, but that's just natural consequence of having NULLs. If
> you have them you better know what NULLs do ...
>
> It's too early to make any strong statements, but it's going to be hard
> to convince me we should handle NULLs differently from what we already
> do elsewhere.
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

> http://www.wiscorp.com/sqlmultisets.zip

> Conceptually, a multiset is an unordered collection of elements, all of the same type, with dupli-
> cates permitted. Unlike arrays, a multiset is an unbounded collection, with no declared maximum
> cardinality. This does not mean that the user can insert elements in a multiset without limit, just
> that the standard does not mandate that there should be a limit. This is analogous to tables, which
> have no declared maximum number of rows.

Postgres arrays don't have size limits.
unordered means no need to use subscript?
So multiset is a more limited array type?

null is fine. but personally I feel like so far the hashset main
feature is the quickly aggregate unique value using hashset.
I found using hashset count distinct (non null values) is quite faster.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2023-06-20 18:43:05 Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX
Previous Message Nathan Bossart 2023-06-20 17:56:34 Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX