From: | Jehan-Guillaume de Rorthais <ioguix(at)free(dot)fr> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Strange avg value size for index on expression in pg_stats |
Date: | 2014-11-10 15:41:11 |
Message-ID: | 20141110164111.665de42c@erg |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Mon, 10 Nov 2014 09:39:23 -0500
Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Jehan-Guillaume de Rorthais <ioguix(at)free(dot)fr> writes:
> > While investigating about a wrong result with my btree bloat estimation
> > query, I found a strange stat deviation between the average size of a value
> > in its table and its average size in one index on the "lower" expression.
>
> ANALYZE doesn't look at the physical index at all. For an expression
> index, it will recompute the index expression at each sampled row, and
> then take statistics on those values. In the case you've got here, it's
> re-running the lower() function and looking at the uncompressed result.
> So that accounts for the discrepancy in average width stats.
Ok, understood.
But in my demo scenario, I used *only* md5 to populate the test table. So
data length is always 32 and the average width length is exact. No possible
deviation, even with lower(). To be quite dense:
postgres(at)test=# select
length(lower(md5('a'))) = length(md5('b')),
length(md5('c')) = length(md5('d'));
-[ RECORD 1 ]
?column? | t
?column? | t
And here is another test with a static string for all rows:
postgres(at)test=# create table test as
select '1234567890123456789012'::text as t
from generate_series(1,100);
SELECT 100
postgres(at)test=# create index ON test (lower(t));
CREATE INDEX
postgres(at)test=# analyze test;
ANALYZE
postgres(at)test=# select tablename, attname, avg_width
from pg_stats
where schemaname = 'public';
tablename | attname | avg_width
----------------+---------+-----------
test | t | 23
test_lower_idx | lower | 26
AFAIU, we should not have a discrepancy here.
> > This tiny difference is the source of a very bad estimation with the
> > Btree bloat estimation query when values are around an alignement
> > boundary.
>
> TBH, if that query is relying on ANALYZE width estimates to be accurate
> to the last byte, its reliability is going to be horrid anyway.
Well, I'm aware of that. I don't need an accuracy to the last byte. This query
doesn't even pay attention to the data alignment padding anyway (I measured some
10% deviation in a case because of this).
This request only helps guessing the bloat evolution in Btrees or quickly
discover *big* deviations. In many situations, we can not afford a call to
pgstattuple.avg_leaf_density().
But this statistic difference between two values with the exact same size is
itching me. Sa far, I couldn't find a logical explanation and it just looks
like a wrong statistic.
Regards,
--
Jehan-Guillaume (ioguix) de Rorthais
From | Date | Subject | |
---|---|---|---|
Next Message | Edson Richter | 2014-11-10 18:45:31 | Centos 7 - anyone? |
Previous Message | Tom Lane | 2014-11-10 14:39:23 | Re: Strange avg value size for index on expression in pg_stats |