From: | Peter Geoghegan <peter(at)2ndquadrant(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Greg Stark <stark(at)mit(dot)edu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com> |
Subject: | Re: sortsupport for text |
Date: | 2012-06-17 22:58:05 |
Message-ID: | CAEYLb_WJMs2oK6rPNjBckQ-tg6mQO4pc_dwTMvfqFZQNGytxGA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 17 June 2012 21:26, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Sure, and in general we only expect that "=" operators mean equivalency;
> a concrete example is float8 "=", which on IEEE-spec machines will say
> that zero and minus zero are equal.
Right; the spec says that, and we punt to the spec. No one sensible
thinks that minus zero and zero floats are not equal, by virtue of the
fact that if you compare them in C using ==, it evaluates to true.
Equivalency as I use the term can't really be talked about without
talking about sorting or less than operators.
> The trick for hashing such datatypes is to be able to guarantee that
> "equal" values hash to the same hash code, which is typically possible
> as long as you know the equality rules well enough. We could possibly
> do that for text with pure-strcoll equality if we knew all the details
> of what strcoll would consider "equal", but we do not.
I see. I am tentatively suggesting that we don't change the definition
of equal, but allow that equivalent text values may not be equal.
> See also citext for an example of a datatype where we can manage to
> treat distinct textual values as equal and still hash them.
I'm not talking about textually distinct values being
equal/equivalent. That doesn't quite capture it (although I suppose a
corollary of what I am trying to express is that equivalent though
non-equal values should invariably have different textual
representations in Postgres).
I think a good example that illustrates the difference between
equivalency and equality is the German ß character (esszet), which is
regarded as equivalent to 'ss' (two 's' characters). The following
German words are sorted correctly as required by de_DE.UTF-8:
Arg
Ärgerlich
Arm
Assistant
Aßlar
Assoziation
So if you take the word "Aßlar" here - that is equivalent to "Asslar",
and so strcoll("Aßlar", "Asslar") will return 0 if you have the right
LC_COLLATE (if you tried this out for yourself and found that I was
actually lying through my teeth, pretend I said Hungarian instead of
German and "some really obscure character" rather than ß). It seems a
bit unsatisfactory to me that a unique constraint will happily accept
these two equivalent strings because they're not bitwise identical,
when the collation was supposed to have that normalisation baked in.
At the same time, I find it intuitively obvious that "Aßlar" ==
"Asslar" is false, and at the very least I think that very few people
would not grant that it is a not unreasonable state of affairs for
both of those two things to hold, at least on the face of it
(presuming that I wasn't actually lying about the particulars of this,
which I was, since this is apparently confined to less popular locales
and I'm too lazy to research the exact details).
Now, I do realise that there is what might appear to be a tension in
what I'm saying; it is precisely the fact that we can traverse a btree
index using comparators that allows a btree to satisfy an equality
condition (or an equivalency condition; however you choose to
characterise whatever it is that the '=' operator does).
To restate the problem: The '=' operator implies equivalency and not
equality. Or does it? Look at this code from nbtutils.c's
_bt_checkkeys() function:
test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
datum, key->sk_argument);
if (!DatumGetBool(test))
{
/*
* Tuple fails this qual. If it's a required qual for the current
* scan direction, then we can conclude no further tuples will
* pass, either.
*
* Note: because we stop the scan as soon as any required equality
* qual fails, it is critical that equality quals be used for the
* initial positioning in _bt_first() when they are available. See
* comments in _bt_first().
*/
***SNIP***
The qual is verified for the index tuple itself (the test return value
lets us know if it matches) - the equality operator is actually
called, and is actually re-verified via texteq(). So what appears to
happen is that the btree code finds equivalent tuples, and then,
knowing that all pairs of equal tuples are equivalent (but not
necessarily the inverse) checks that it actually has a tuple that's
equal/satisfies the qual. Makes sense, and by this standard I'd judge
that '=' was actually an equality operator that sometimes took
advantage of equivalency purely as an implementation detail, but I'm
not familiar enough with that part of the code to have any degree of
confidence that I haven't made a leap here that I shouldn't have.
ISTM if '=' was really a mere equivalency operator, we'd only every
check (a < b && b < a) in the btree code.
It occurs to me that we might also have a new equivalency operator
(=== or something) so that we actually could answer questions like
"show me the tuples with the word Aßlar in some non-unique column -
both spellings will do". Maybe that's a bit off the wall, I'm not
sure.
Simple question: if you were to just remove the strcmp tie-breaker for
strcoll() in varstr_cmp(), but not touch anything else, would Postgres
exhibit objectively incorrect behaviour?
Incidentally, I think it's interesting that the SQL-exposed interface
to all of this merely presents the user with a boolean:
postgres=# select '4'::text < '4'::text;
?column?
----------
f
(1 row)
This is pretty much what C++ does; operator< and friends
conventionally return a bool, not an int. This might have something to
do with the need to make a break from the qsort() convention of
returning an int that may indicate equality. So the formal definition
of equivalency there may be that two equivalent values will always be
either less than each other or not less than each other - you have no
business expecting them to come out of a sort in any particular order
relative to each other (except with a stable sort). Which is not the
same as equal or "textually distinct but equal", which I think is how
I'd classify your IEEE 754 example.
We can decree that equivalency implies equality, or make all this
internal (which, perversely, I suppose the C++ committee people
cannot).
So, I may have lost sight of why I starting on about equivalency,
which is that it sure would be nice if we could use strxfrm to prepare
strings for sorting, which looks to be a fairly significant win. Does
anyone have any simpler ideas, assuming this one is no good?
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2012-06-17 23:06:24 | Re: Testing 9.2 in ~production environment |
Previous Message | James Cloos | 2012-06-17 22:51:51 | Testing 9.2 in ~production environment |