Strange avg value size for index on expression in pg_stats

From: Jehan-Guillaume de Rorthais <ioguix(at)free(dot)fr>
To: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Strange avg value size for index on expression in pg_stats
Date: 2014-11-10 10:52:54
Message-ID: 20141110115254.2997aaf8@erg
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello,

I'm not sure if I should post here or on pgsql-hackers.

While investigating about a wrong result with my btree bloat estimation query,
I found a strange stat deviation between the average size of a value in its
table and its average size in one index on the "lower" expression.

Take the following scenario:

postgres(at)test=# create table test as
test-# select md5(t::text) AS t
test-# from generate_series(1, 1000000) t;
SELECT 1000000

postgres(at)test=# create index ON test (lower(t));
CREATE INDEX

postgres(at)test=# analyze test;
ANALYZE

postgres(at)test=# select tablename, attname, avg_width
test-# from pg_stats
test-# where schemaname = 'public';
tablename | attname | avg_width
----------------+---------+-----------
test | t | 33
test_lower_idx | lower | 36

Md5 values are always 32 bytes long. In the table, pg_stats reports 33 because
of the text header. In the index, the reported value is 36!

Looking at the page layout documentation and in the index using hexdump, I can
not find any answer about this 3 bytes. PFA the "hexdump -C" output from the
index. For each row, we clearly see a 8 bytes row header followed by a ONE byte
value header (43 'C'), the 32 bytes of the md5 and 7 bytes of padding (00).

A wild guess would be that ANALYZE is considering a text field from an
expression has always a 4 bytes header whatever its actual size (text field
header size is one if the value is < 127 bytes long, 4 in other situations).

This tiny difference is the source of a very bad estimation with the Btree bloat
estimation query when values are around an alignement boundary. As instance,
here is the use case that lead me to this:

tablename | attname | avg_width
--------------------+---------+-----------
customer | email | 23
customer_lower_idx | lower | 26

We have an index on email, and another one on lower(index). The first one is
aligned on 24, the second one on 32. Leading to bloat estimation of 17% for
the first one and -1% for the second one (the estimated index size is bigger
than the real one).

Any information about this from a hacker? Do anyone have an explanation about
this? Is it something that's worth posting on pgsql-hackers?

Regards,

Attachment Content-Type Size
index-lower-hexdump-C.txt text/plain 25.8 KB

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Jehan-Guillaume de Rorthais 2014-11-10 11:00:45 Re: Strange avg value size for index on expression in pg_stats
Previous Message Ilya Ashchepkov 2014-11-10 09:07:48 justify_interval: days in year