Quick Links

Re: custom hash-based COUNT(DISTINCT) aggregate - unexpectedly high memory consumption

From:	Huchev <hugochevrain(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: custom hash-based COUNT(DISTINCT) aggregate - unexpectedly high memory consumption
Date:	2013-10-11 11:42:44
Message-ID:	1381491764000-5774264.post@n5.nabble.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

gettimeofday(&start, NULL);
for (i = 0; i < VALUES; i++) {
state = XXH32_init(result);
XXH32_update(state, &i, 4);
XXH32_digest(state);
}
gettimeofday(&end, NULL);

This code is using the "update" variant, which is only useful when dealing
with very large amount of data which can't fit into a single block of
memory. This is obviously overkill for a 4-bytes-only test. 3 functions
calls, a malloc, intermediate data book keeping, etc.

To hash a single block of data, it's better to use the simpler (and faster)
variant XXH32() :

gettimeofday(&start, NULL);
for (i = 0; i < VALUES; i++) { XXH32(&i, 4, result); }
gettimeofday(&end, NULL);

You'll probably get better results by an order of magnitude. For better
results, you could even inline it (yes, for such short loop with almost no
work to do, it makes a very sensible difference).

That being said, it's true that these advanced hash algorithms only shine
with "big enough" amount of data to hash. Hashing a 4-bytes value into a
4-bytes hash is a bit limited exercise. There is no "pigeon hole" issue. A
simple multiplication by a 32-bits prime would fare good enough and result
in zero collision.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/custom-hash-based-COUNT-DISTINCT-aggregate-unexpectedly-high-memory-consumption-tp5773463p5774264.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

In response to

Re: Re: custom hash-based COUNT(DISTINCT) aggregate - unexpectedly high memory consumption at 2013-10-07 19:56:33 from Tomas Vondra

Responses

Re: Re: custom hash-based COUNT(DISTINCT) aggregate - unexpectedly high memory consumption at 2013-10-13 01:01:14 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2013-10-11 12:31:48	Re: WITHIN GROUP patch
Previous Message	Cédric Villemain	2013-10-11 10:34:23	Re: Bugfix and new feature for PGXS