From: | Francisco Olarte <folarte(at)peoplecall(dot)com> |
---|---|
To: | "Peter J(dot) Holzer" <hjp-pgsql(at)hjp(dot)at> |
Cc: | pgsql-general(at)lists(dot)postgresql(dot)org |
Subject: | Re: String comparison problem in select - too many results |
Date: | 2018-01-14 12:20:05 |
Message-ID: | CA+bJJbxnMkZvsXbo44a-+Pt+i6cTX5xhQYdemx+JrWRFJ4DO4A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Sun, Jan 14, 2018 at 12:14 PM, Peter J. Holzer <hjp-pgsql(at)hjp(dot)at> wrote:
> On 2018-01-12 11:08:39 +0100, Francisco Olarte wrote:
>> C collation is like sorting raw bytes, it doesn't event sort
>> upper/lower case correctly
>
> Now you are falling into the same trap as Durumdara, calling an
> unintended sort order "not correct" ;-).
Well, had you quoted / read a little farther:
>>
C collation is like sorting raw bytes, it doesn't event sort
upper/lower case correctly ( Do not know how you do it in HU, but in
ES we sort aA before bB, while C locale normally sorts AB..ab.. It's
what non-locale aware programs use because it's dirt cheap to
implement, just sort the data as unsigned byte arrays
lexicographically. And in fact, IIRC, utf-8 is dessigned in a way that
this works too ( You can sort by raw bytes and you get the data sorted
lexicographically sorted by code points ).
<<
You'll see I was trying to define "correctly" somehow. English is not
my native language and it seems I obviously failed at it, I'll try to
do it better next time.
> C collation is certainly not what a "normal user" expects. It is
> therefore wrong for many applications (e.g., you couldn't use it to
> sort a telephone book), but it might be correct for others (e.g., it you
> need a stable, unambiguous sort order but don't care much about the
> order itself).
stable / unambiguous is shared by a la lot of collation methods, but I
see what you try to say.
> (By coincidence, I also stumbled over the fact that the en_US.UTF-8
> collation ignores punctuation characters and spaces at least in the
> first pass - took me a minute or so to figure out why I got an
> "inconsistent" (read: unexpected and not obvious for me) sort order.)
Nearly all the locale-aware sorts do funny things with
punctuation/spaces. I've found sort more and more surprissing since I
started ( with the electromechanical IBM card sorters, those )
> Lexicographers are interested in sorting single words. Therefore they
> aren't interested in punctuation. They may also not be very interested
> in accents, because most languages have few words which differ only by
> an accent
We have loads of them in spanish, but they are normally easy and many
of them come from verbs conjugation which does not o into the
dictionary ...
> (and it the dictionary is printed on paper, the user will
> easily be able to scan up and down a few centimeters and find the right
> entry without knowing whether "à" is sorted before or after "á").
and we have none of this, I only know French doing it.
> Somebody preparing an address list should care about punctuation: You
> would probably not expect to find "Smith-Jones, Amanda" between "Smith,
> John" and "Smith, Julia". And you probably want to sort "23 Main Street"
> before "180 Main Street".
> Which brings us back to Durumdara's example: I don't know his
> application, so I don't know what "normal users" of his application
> would expect, but to me those values look like two numbers separated by
> a slash. So I would expect '23/4567' to be sorted between '18/0212' and
> '180/2010', but of course the C collation doesn't do that:
He does not seem to want this. As all his examples use the same prefix
I think he just want to extract a small range of keys with a common
prefix . I've had this problems when "augmenting" a part-number code
to be product-part, and using things like "between xxx- and xxx-zz" to
get all parts for product xxx ( it was changed to dual fields at the
next iteration, but sometimes you need these things for the interim ).
I mean, the fact that they are both numbers doesn't mean he wants /
need numerical ordering on them, for many purposes just collapsing
prefixes is enough.
>
> => select * from foo order by t collate "C";
> ╔══════════╗
> ║ t ║
> ╟──────────╢
> ║ 18/0113 ║
> ║ 18/0212 ║
> ║ 180/2010 ║
> ║ 23/4567 ║
> ╚══════════╝
> (4 rows)
>
> It might be possible to define a custom collation for that, but in a
> case like this I would either split this field into two integer fields
> or use a function-based index.
Yep, but he may have a temporary problem. C collation puts all the
"prefixes" together, which normally is good enough.
Francisco Olarte.
From | Date | Subject | |
---|---|---|---|
Next Message | Francisco Olarte | 2018-01-14 12:30:29 | Re: Is ORDER BY in sub-query preserved when outer query is only projection? |
Previous Message | Andreas Joseph Krogh | 2018-01-14 12:03:08 | Is ORDER BY in sub-query preserved when outer query is only projection? |