Re: B-Tree support function number 3 (strxfrm() optimization)

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thom Brown <thom(at)linux(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: B-Tree support function number 3 (strxfrm() optimization)
Date: 2014-06-12 21:09:33
Message-ID: CAM3SWZSq96DKTa4otA5Z71-8PwhNrWEkPFYJKGaArGhaSMdKHQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for looking into this.

On Thu, Jun 12, 2014 at 9:25 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Still, it's fair to say that on this Linux system, the first 8 bytes
> capture a significant portion of the entropy of the first 8 bytes of
> the string, whereas on MacOS X you only get entropy from the first 2
> bytes of the string. It would be interesting to see results from
> other platforms people might care about also.

Right. It was a little bit incautious of me to say that we had the
full benefit of 8 bytes of storage with "en_US.UTF-8", since that is
only true of lower case characters (I think that FreeBSD can play
tricks with this. Sometimes, it will give you the benefit of 8 bytes
of entropy for an 8 byte string, with only non-differentiating
trailing bytes, so that the first 8 bytes of "Aaaaaaaa" are distinct
from the first eight bytes of "aaaaaaaa", while any trailing bytes are
non-distinct for both). In any case it's pretty clear that a goal of
the glibc implementation is to concentrate as much entropy as possible
into the first part of the string, and that's the important point.
This makes perfect sense, and is why I was so incredulous about the
Mac behavior. After all, the Open Group's strcoll() documentation
says:

"The strxfrm() and strcmp() functions should be used for sorting large lists."

Sorting text is hardly an infrequent requirement -- it's almost the
entire reason for having strxfrm() in the standard. You're always
going to want to have each strcmp() find differences as early as
possible.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2014-06-12 21:09:41 Re: lo_create(oid, bytea) breaks every extant release of libpq
Previous Message Noah Misch 2014-06-12 21:02:19 Crash on backend exit w/ OpenLDAP [2.4.24, 2.4.31]