Re: Duplicate Values or Not?!

From: Greg Stark <gsstark(at)mit(dot)edu>
To: Greg Stark <gsstark(at)MIT(dot)EDU>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, John Seberg <johnseberg(at)yahoo(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Duplicate Values or Not?!
Date: 2005-09-17 15:50:44
Message-ID: 87fys3r8vf.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


Greg Stark <gsstark(at)MIT(dot)EDU> writes:

> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>
> > If that does change the results, it indicates you've got strings which
> > are bytewise different but compare equal according to strcoll(). We've
> > seen this and other misbehaviors from some locale definitions when faced
> > with data that is invalid per the encoding the locale expects.
>
> There are plenty of non-bytewise-identical strings that do legitimately
> compare equal in various locales. Does the hash code hash strxfrm or the
> original bytes?

Hm. Some experimentation shows that at least on glibc's locale definitions the
strings that I thought compared equal don't actually compare equal.
Capitalization, punctuation, white space, while they're basically ignored in
general in non-C locales do seem to compare non-equal when they're the only
differentiating factor.

Is this guaranteed by any spec? Or is counting on this behaviour unsafe?

If it's legal for strcoll to compare as equal two byte-wise different strings
then the hash function really ought to be calling strxfrm before hashing or
else it will be inconsistent. It doesn't seem to be doing so currently.

I find it interesting that Perl has faced this same dilemma and chose to
override the locale definition in this case. If the locale definitions
compares two strings equally then Perl does a bytewise comparison and uses
that to break ties. This guarantees non-bytewise-identical strings don't
compare eqal. I suspect they did it for a similar reason too, namely keeping
the semantics in sync with perl hashes.

Postgres could follow that model, I think it would solve any inconsistencies
just fine and not cause problems. However it would be visible to users which
may be considered a bug if the locale really does claim the strings are equal
but Postgres doesn't agree. On the other hand I think it would perform better
than a lot of extra calls to strxfrm since it would only rarely kick in with
an extra memcmp.

--
greg

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Martijn van Oosterhout 2005-09-17 17:13:50 Re: Duplicate Values or Not?!
Previous Message Greg Stark 2005-09-17 14:51:10 Re: Duplicate Values or Not?!