Quick Links

Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog

From:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog
Date:	2022-05-25 09:55:40
Message-ID:	76f80a24-fe89-c6f9-5e79-239a02e8a2b7@enterprisedb.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 5/25/22 00:16, Bruce Momjian wrote:
> On Mon, May 16, 2022 at 12:09:41AM +0200, Tomas Vondra wrote:
>> I think it's an interesting idea. In principle it allows deducing the
>> multi-column MCV for arbitrary combination of columns, not determined in
>> advance. We'd have the MCV with HLL instead of frequencies for columns
>> A, B and C:
>>
>> (a1, hll(a1))
>> (a2, hll(a2))
>> (...)
>> (aK, hll(aK))
>>
>>
>> (b1, hll(b1))
>> (b2, hll(b2))
>> (...)
>> (bL, hll(bL))
>>
>> (c1, hll(c1))
>> (c2, hll(c2))
>> (...)
>> (cM, hll(cM))
>>
>> and from this we'd be able to build MCV for any combination of those
>> three columns.
>
> Sorry, but I am lost here. I read about HLL here:
>
> https://towardsdatascience.com/hyperloglog-a-simple-but-powerful-algorithm-for-data-scientists-aed50fe47869
>
> However, I don't see how they can be combined for multiple columns.

It's the same as combining multiple HLL filters. HLL is essentially just
an array of counters, and to calculate a union (i.e. HLL for elements in
at least one of the input HLL sketches), you can just do Max() of the
counters. For intersection, you have to use inclusion-exclusion
principle, i.e.

intersection(HLL1, HLL2)
= estimate(HLL1) + estimate(HLL2) - estimate(union(HLL1,HLL2))

which is exactly the same as

P(A & B) = P(A) + P(B) - P(A | B)

There's more in:

https://github.com/citusdata/postgresql-hll

https://agkn.wordpress.com/2012/12/17/hll-intersections-2/

which also mentions the weakness - error is proportional to the size of
the union, while the intersection may be much smaller. Which might be an
issue, especially when combining multiple columns.

> Above, I know A,B,C are columns, but what is a1, a2, etc?

a1 is a value in column A, common enough to make it into the MCV. But
instead of just a frequency, we store a HLL tracking unique rows (by
adding CTID to the HLL).

So for example for a "city" column, you'd have

("NY", HLL of CTIDs for rows with city = NY)
("Philadephia", HLL of CTIDs for rows with city = Philadelphia)
...

etc.

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog at 2022-05-24 22:16:43 from Bruce Momjian

Responses

Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog at 2022-05-26 01:19:35 from Bruce Momjian

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tomas Vondra	2022-05-25 10:13:45	Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)
Previous Message	Ranier Vilela	2022-05-25 09:22:24	Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)