Re: Do we want a hashset type?

From: jian he <jian(dot)universality(at)gmail(dot)com>
To: Joel Jacobson <joel(at)compiler(dot)org>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Tom Dunstan <pgsql(at)tomd(dot)cc>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Do we want a hashset type?
Date: 2023-06-29 06:54:44
Message-ID: CACJufxFxpMYEkMpoRrYP7YMTp5TkyzWq7piYE5gYB6sCr2mW1g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 28, 2023 at 4:50 PM Joel Jacobson <joel(at)compiler(dot)org> wrote:
>
> On Wed, Jun 28, 2023, at 08:26, jian he wrote:
>
> > Hi there.
> > I changed the function hashset_contains to strict.
>
> Changing hashset_contains to STRICT would cause it to return NULL
> if any of the operands are NULL, which I don't believe is correct, since:
>
> SELECT NULL = ANY('{}'::int4[]);
> ?column?
> ----------
> f
> (1 row)
>
> Hence, `hashset_contains('{}'::int4hashset, NULL)` should also return FALSE,
> to mimic the semantics of arrays and MULTISET's MEMBER OF predicate in SQL:2023.
>
> Did you try running `make installcheck` after your change?
> You would then have seen one of the tests failing:
>
> test array-and-multiset-semantics ... FAILED 21 ms
>
> Check the content of `regression.diffs` to see why:
>
> % cat regression.diffs
> diff -U3 /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out /Users/joel/src/hashset/results/array-and-multiset-semantics.out
> --- /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out 2023-06-27 10:07:38
> +++ /Users/joel/src/hashset/results/array-and-multiset-semantics.out 2023-06-28 10:13:27
> @@ -158,7 +158,7 @@
> | | {NULL} | {NULL} | |
> | 1 | {1} | {1} | |
> | 4 | {4} | {4} | |
> - {} | | {NULL} | {NULL} | f | f
> + {} | | {NULL} | {NULL} | | f
> {} | 1 | {1} | {1} | f | f
> {} | 4 | {4} | {4} | f | f
> {NULL} | | {NULL} | {NULL,NULL} | |
> @@ -284,7 +284,8 @@
> "= ANY(...)";
> arg1 | arg2 | hashset_add | array_append | hashset_contains | = ANY(...)
> ------+------+-------------+--------------+------------------+------------
> -(0 rows)
> + {} | | {NULL} | {NULL} | | f
> +(1 row)
>
>
> > also change the way to return an empty array.
>
> Nice.
> I agree the `Datum d` variable was unnecessary.
> I also removed the unused includes.
>
> > in benchmark.sql, would it be ok to use EXPLAIN to demonstrate that
> > int4hashset can speed distinct aggregate and distinct counts?
> > like the following:
> >
> > explain(analyze, costs off, timing off, buffers)
> > SELECT array_agg(DISTINCT i) FROM benchmark_input_100k \watch c=3
> >
> > explain(analyze, costs off, timing off, buffers)
> > SELECT hashset_agg(i) FROM benchmark_input_100k \watch c=3
>
> The 100k tables seems to be too small to give any meaningful results,
> when trying to measure individual queries:
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT array_agg(DISTINCT i) FROM benchmark_input_100k;
> Execution Time: 26.790 ms
> Execution Time: 30.616 ms
> Execution Time: 33.253 ms
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT hashset_agg(i) FROM benchmark_input_100k;
> Execution Time: 32.797 ms
> Execution Time: 27.605 ms
> Execution Time: 26.228 ms
>
> If we instead try the 10M tables, it looks like array_agg(DISTINCT ...)
> is actually faster for the `i` column where all input integers are unique:
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT array_agg(DISTINCT i) FROM benchmark_input_10M;
> Execution Time: 799.017 ms
> Execution Time: 796.008 ms
> Execution Time: 799.121 ms
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT hashset_agg(i) FROM benchmark_input_10M;
> Execution Time: 1204.873 ms
> Execution Time: 1221.822 ms
> Execution Time: 1216.340 ms
>
> For random integers, hashset is a win though:
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT array_agg(DISTINCT rnd) FROM benchmark_input_10M;
> Execution Time: 1874.722 ms
> Execution Time: 1878.760 ms
> Execution Time: 1861.640 ms
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT hashset_agg(rnd) FROM benchmark_input_10M;
> Execution Time: 1253.709 ms
> Execution Time: 1222.651 ms
> Execution Time: 1237.849 ms
>
> > explain(costs off,timing off, analyze,buffers)
> > select count(distinct rnd) from benchmark_input_100k \watch c=3
> >
> > explain(costs off,timing off, analyze,buffers)
> > SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM
> > benchmark_input_100k) sub(x) \watch c=3
>
> I tried these with 10M:
>
> EXPLAIN(costs off,timing off, analyze,buffers)
> SELECT COUNT(DISTINCT rnd) FROM benchmark_input_10M;
> Execution Time: 1733.320 ms
> Execution Time: 1725.214 ms
> Execution Time: 1716.636 ms
>
> EXPLAIN(costs off,timing off, analyze,buffers)
> SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM benchmark_input_10M) sub(x);
> Execution Time: 1249.612 ms
> Execution Time: 1240.558 ms
> Execution Time: 1252.103 ms
>
> Not sure what I think of the current benchmark suite.
>
> I think it would be better to only include some realistic examples from
> real-life, such as the graph query which was the reason I personally started
> working on this. Otherwise there is a risk we optimise for some hypothetical
> scenario that is not relevant in practise.
>
> Would be good with more examples of typical work loads for when the hashset
> type would be useful.
>
> /Joel

> Did you try running `make installcheck` after your change?

First I use make installcheck
PG_CONFIG=/home/jian/postgres/2023_05_25_beta5421/bin/pg_config
I found out it uses another active cluster.
So I killed another active cluster.
later i found another database port so I took me sometime to found out
I need use following:
make installcheck
PG_CONFIG=/home/jian/postgres/2023_05_25_beta5421/bin/pg_config
PGPORT=5421

Anyway, this time, I added another macro,which seems to simplify the code.

#define SET_DATA_PTR(a) \
(((char *) (a->data)) + CEIL_DIV(a->capacity, 8))

it passed all the tests on my local machine.
I should have only made a patch, but when I was committed, I forgot to
mention one file, so I needed 2 commits.

> Not sure what I think of the current benchmark suite.
your result is so different from mine. I use the default config. I
see a big difference. yech, I agree, the performance test should be
more careful.

Attachment Content-Type Size
0002-marco-SET_DATA_PTR-to-quicly-access-hashset-data-reg.patch text/x-patch 3.4 KB
0001-marco-SET_DATA_PTR-to-quicly-access-hashset-data-reg.patch text/x-patch 2.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2023-06-29 07:06:35 Re: Add more sanity checks around callers of changeDependencyFor()
Previous Message Richard Guo 2023-06-29 06:44:43 Re: Assert !bms_overlap(joinrel->relids, required_outer)