From: | "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de> |
---|---|
To: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Calculage avg. width when operator = is missing |
Date: | 2015-09-22 20:05:55 |
Message-ID: | CACACo5T=D=0Kiok=Ok53G3ijQTcY82kfSuDEqjL9antMqnWn9Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sep 22, 2015 8:58 PM, "Andrew Dunstan" <andrew(at)dunslane(dot)net> wrote:
>
>
>
> On 09/22/2015 12:16 PM, Shulgin, Oleksandr wrote:
>>
>> Hi Hackers,
>>
>> I've recently stumbled upon a problem with table bloat estimation in
case there are columns of type JSON.
>>
>> The quick bloat estimation queries use sum over pg_statistic.stawidth of
table's columns, but in case of JSON the corresponding entry is never
created by the ANALYZE command due to equality comparison operator
missing. I understand why there is no such operator defined for this
particular type, but shouldn't we still try to produce meaningful average
width estimation?
>>
>> In my case the actual bloat is around 40% as verified with pgstattuple,
while the bloat reported by quick estimate can be between 75% and 95%(!) in
three instances of this problem. We're talking about some hundreds of GB
of miscalculation.
>>
>> Attached patch against master makes the std_typanalyze still try to
compute the minimal stats even if there is no "=" operator. Makes sense?
>>
>> I could also find this report in archives that talks about similar
problem, but due to all values being over the analyze threshold:
>>
>>
http://www.postgresql.org/message-id/flat/12480(dot)1389370514(at)sss(dot)pgh(dot)pa(dot)us#12480(dot)1389370514@sss.pgh.pa.us
>>
>> I think we could try harder, otherwise any estimate relying on average
width can be way off in such cases.
>
> Yes, "/revenons/ à /nos moutons/." You can set up text based comparison
ops fairly easily for json - you just need to be aware of the limitations.
See https://gist.github.com/adunstan/32ad224d7499d2603708
Yes, I've already tried this approach and have found that analyze
performance degrades an order of magnitude due to sql-level function
overhead and casts to text. In my tests, from 200ms to 2000ms with btree
ops on a default sample of 30,000 rows.
Should have mentioned that.
There is a very hacky way to substitute bttextcmp for the sort support
function after defining the opclass by updating pg_amproc, buy I would
rather avoid that. :-)
--
Alex
From | Date | Subject | |
---|---|---|---|
Next Message | Christopher Browne | 2015-09-22 20:17:08 | Re: [COMMITTERS] pgsql: Use gender-neutral language in documentation |
Previous Message | Joshua D. Drake | 2015-09-22 19:57:05 | Re: [COMMITTERS] pgsql: Use gender-neutral language in documentation |