Re: Use extended statistics to estimate (Var op Var) clauses

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Use extended statistics to estimate (Var op Var) clauses
Date: 2021-08-11 15:18:44
Message-ID: 633e406e-4e65-59a7-5c8a-153a9ce224a6@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 8/11/21 4:51 PM, Mark Dilger wrote:
>
>
>> On Aug 11, 2021, at 5:08 AM, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
>>
>> This feels like rather an artificial example though. Is there any real
>> use for this sort of clause?
>
> The test generated random combinations of clauses and then checked if any had consistently worse performance. These came up. I don't know that they represent anything real.
>
> What was not random in the tests was the data in the tables. I've gotten curious if these types of clauses (with columns compared against themselves) would still be bad for random rather than orderly data sets. I'll go check....
>
> testing....
>
> Wow. Randomizing the data makes the problems even more extreme. It seems my original test set was actually playing to this patch's strengths, not its weaknesses. I've changed the columns to double precision and filled the columns with random() data, where column1 gets random()^1, column2 gets random()^2, etc. So on average the larger numbered columns will be smaller, and the mcv list will be irrelevant, since values should not tend to repeat.
>
> Over all queries, 47791 have better estimates after the patch, but 34802 had worse estimates after the patch (with the remaining 17407 queries having roughly equal quality).
>
> The worst estimates are still ones that have a column compared to itself:
>
> better:0, worse:33: A <= B or A <= A or A <= A
> better:0, worse:33: A <= B or A = A or not A <> A
> better:0, worse:33: A <= B or A >= A or not A <> A
> better:0, worse:33: A <> B or A <= A
> better:0, worse:33: A <> B or A <= A or A <> A
> better:0, worse:33: A <> B or A <= A or A >= A
> better:0, worse:33: A <> B or A <= A or not A = A
> better:0, worse:33: A <> B or A > A or not A < A
> better:0, worse:33: A <> B or A >= A
> better:0, worse:33: A <> B or A >= A and A <= A
> better:0, worse:33: A = B or not A > A or not A > A
> better:0, worse:33: A >= B or not A <> A or A = A
> better:0, worse:39: B <= A or B <= B or B <= B
> better:0, worse:39: B <= A or B = B or not B <> B
> better:0, worse:39: B <= A or B >= B or not B <> B
> better:0, worse:39: B <> A or B <= B
> better:0, worse:39: B <> A or B <= B or B <> B
> better:0, worse:39: B <> A or B <= B or B >= B
> better:0, worse:39: B <> A or B <= B or not B = B
> better:0, worse:39: B <> A or B > B or not B < B
> better:0, worse:39: B <> A or B >= B
> better:0, worse:39: B <> A or B >= B and B <= B
> better:0, worse:39: B = A or not B > B or not B > B
> better:0, worse:39: B >= A or not B <> B or B = B
>

The other interesting thing all those clauses have in common is that
they're OR clauses. And we handle that a bit differently. But I think
the "strange" clauses with the same Var on both sides is the main issue,
and not detecting them as "simple" clauses should fix that.

> But there are plenty that got worse without that, such as the following examples:
>
> better:25, worse:39: A < B and A < B or B > A
> better:10, worse:48: A < B and A < C
> better:10, worse:54: A < B and A < C or C > A
>
> I'll go test random data designed to have mcv lists of significance....
>

Hard to say without having a look at the data set, but there'll always
be cases where the extended stats perform a bit worse, due to (a) luck
and (b) the stats covering only small fraction of the table.

But of course, it's worth investigating the suspicious cases.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2021-08-11 15:51:00 Re: Next Steps with Hash Indexes
Previous Message Tom Lane 2021-08-11 15:17:57 Re: Next Steps with Hash Indexes