From: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
---|---|
To: | Bruce Momjian <bruce(at)momjian(dot)us> |
Cc: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog |
Date: | 2022-05-25 09:55:40 |
Message-ID: | 76f80a24-fe89-c6f9-5e79-239a02e8a2b7@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 5/25/22 00:16, Bruce Momjian wrote:
> On Mon, May 16, 2022 at 12:09:41AM +0200, Tomas Vondra wrote:
>> I think it's an interesting idea. In principle it allows deducing the
>> multi-column MCV for arbitrary combination of columns, not determined in
>> advance. We'd have the MCV with HLL instead of frequencies for columns
>> A, B and C:
>>
>> (a1, hll(a1))
>> (a2, hll(a2))
>> (...)
>> (aK, hll(aK))
>>
>>
>> (b1, hll(b1))
>> (b2, hll(b2))
>> (...)
>> (bL, hll(bL))
>>
>> (c1, hll(c1))
>> (c2, hll(c2))
>> (...)
>> (cM, hll(cM))
>>
>> and from this we'd be able to build MCV for any combination of those
>> three columns.
>
> Sorry, but I am lost here. I read about HLL here:
>
> https://towardsdatascience.com/hyperloglog-a-simple-but-powerful-algorithm-for-data-scientists-aed50fe47869
>
> However, I don't see how they can be combined for multiple columns.
It's the same as combining multiple HLL filters. HLL is essentially just
an array of counters, and to calculate a union (i.e. HLL for elements in
at least one of the input HLL sketches), you can just do Max() of the
counters. For intersection, you have to use inclusion-exclusion
principle, i.e.
intersection(HLL1, HLL2)
= estimate(HLL1) + estimate(HLL2) - estimate(union(HLL1,HLL2))
which is exactly the same as
P(A & B) = P(A) + P(B) - P(A | B)
There's more in:
https://github.com/citusdata/postgresql-hll
https://agkn.wordpress.com/2012/12/17/hll-intersections-2/
which also mentions the weakness - error is proportional to the size of
the union, while the intersection may be much smaller. Which might be an
issue, especially when combining multiple columns.
> Above, I know A,B,C are columns, but what is a1, a2, etc?
a1 is a value in column A, common enough to make it into the MCV. But
instead of just a frequency, we store a HLL tracking unique rows (by
adding CTID to the HLL).
So for example for a "city" column, you'd have
("NY", HLL of CTIDs for rows with city = NY)
("Philadephia", HLL of CTIDs for rows with city = Philadelphia)
...
etc.
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2022-05-25 10:13:45 | Re: Improving connection scalability (src/backend/storage/ipc/procarray.c) |
Previous Message | Ranier Vilela | 2022-05-25 09:22:24 | Re: Improving connection scalability (src/backend/storage/ipc/procarray.c) |