Re: Strange avg value size for index on expression in pg_stats

From: Jehan-Guillaume de Rorthais <ioguix(at)free(dot)fr>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Strange avg value size for index on expression in pg_stats
Date: 2014-11-10 15:41:11
Message-ID: 20141110164111.665de42c@erg
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, 10 Nov 2014 09:39:23 -0500
Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Jehan-Guillaume de Rorthais <ioguix(at)free(dot)fr> writes:
> > While investigating about a wrong result with my btree bloat estimation
> > query, I found a strange stat deviation between the average size of a value
> > in its table and its average size in one index on the "lower" expression.
>
> ANALYZE doesn't look at the physical index at all. For an expression
> index, it will recompute the index expression at each sampled row, and
> then take statistics on those values. In the case you've got here, it's
> re-running the lower() function and looking at the uncompressed result.
> So that accounts for the discrepancy in average width stats.

Ok, understood.

But in my demo scenario, I used *only* md5 to populate the test table. So
data length is always 32 and the average width length is exact. No possible
deviation, even with lower(). To be quite dense:

postgres(at)test=# select
length(lower(md5('a'))) = length(md5('b')),
length(md5('c')) = length(md5('d'));
-[ RECORD 1 ]
?column? | t
?column? | t

And here is another test with a static string for all rows:

postgres(at)test=# create table test as
select '1234567890123456789012'::text as t
from generate_series(1,100);
SELECT 100

postgres(at)test=# create index ON test (lower(t));
CREATE INDEX

postgres(at)test=# analyze test;
ANALYZE

postgres(at)test=# select tablename, attname, avg_width
from pg_stats
where schemaname = 'public';
tablename | attname | avg_width
----------------+---------+-----------
test | t | 23
test_lower_idx | lower | 26

AFAIU, we should not have a discrepancy here.

> > This tiny difference is the source of a very bad estimation with the
> > Btree bloat estimation query when values are around an alignement
> > boundary.
>
> TBH, if that query is relying on ANALYZE width estimates to be accurate
> to the last byte, its reliability is going to be horrid anyway.

Well, I'm aware of that. I don't need an accuracy to the last byte. This query
doesn't even pay attention to the data alignment padding anyway (I measured some
10% deviation in a case because of this).

This request only helps guessing the bloat evolution in Btrees or quickly
discover *big* deviations. In many situations, we can not afford a call to
pgstattuple.avg_leaf_density().

But this statistic difference between two values with the exact same size is
itching me. Sa far, I couldn't find a logical explanation and it just looks
like a wrong statistic.

Regards,
--
Jehan-Guillaume (ioguix) de Rorthais

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Edson Richter 2014-11-10 18:45:31 Centos 7 - anyone?
Previous Message Tom Lane 2014-11-10 14:39:23 Re: Strange avg value size for index on expression in pg_stats