Re: Abbreviated keys for Numeric

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Abbreviated keys for Numeric
Date: 2015-02-21 22:09:37
Message-ID: CAM3SWZRR+LdpadytaPE9qeNrw34UwUaMhiuCUGH0Fc5e07K_uA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Feb 21, 2015 at 12:15 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> Although I cannot easily explain the disparity in performance between
>> 1M and 5M sized sets for this query:
>>
>> select count(distinct randtxt) from stuff_text
>>
>> You did make sure that the queries didn't spill to disk, right? Or
>> that they did so consistently, at least.
>
> All the queries were running with work_mem=1GB, and I don't think they
> were spilling to disk. Actually, I don't have the 'merge' patch applied,
> so that would probably crash because of SIGSEGV.

Actually, that isn't so -- Andrew's datum sort patch incidentally
fixes tuplesort_begin_datum() in the same way as the patch I posted.
You shouldn't actually need my patch, which was just an immediate fix.

I can recreate the problem you see with text sort regressions.
Abbreviation is aborted for the case in question, unsurprisingly, and
fairly far in. With that many tuples, the idea of taking abbreviated
cardinality as a proxy for full cardinality becomes less important,
because either way you have to do at least 10 comparisons per item on
average. Originally, I had as a heuristic that once you get to a
million items without aborting, stop considering the possibility of
aborting - that much should probably be added back, at a minimum.
Andrew is already doing that in his numeric patch (at 100,000 tuples),
and that may be the only reason why numeric is not regressed too.

For a query like this, if I "#define DEBUG_ABBREV_KEYS", so that we
don't actually abort, and run this query of yours until it stabilizes:

select * from (select * from stuff_text order by randtxt offset
100000000000) foo;

It takes about 5.3 seconds. If I don't, however - if it is allowed to
run its course - it takes a whopping 35 seconds. So the ad-hoc cost
model of the text opclass definitely needs more work, as I suspected.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2015-02-21 22:16:47 Re: Abbreviated keys for Numeric
Previous Message Tom Lane 2015-02-21 21:57:58 Re: Expanding the use of FLEXIBLE_ARRAY_MEMBER for declarations like foo[1]