From: | Dennis Gearon <gearond(at)cvc(dot)net> |
---|---|
To: | Dennis Björklund <db(at)zigo(dot)dhs(dot)org> |
Cc: | Maksim Likharev <mlikharev(at)aurigin(dot)com>, pgsql-general(at)postgresql(dot)org |
Subject: | Re: Sorting Problem |
Date: | 2003-08-13 16:09:00 |
Message-ID: | 3F3A629C.5090307@cvc.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Dennis Björklund wrote:
> In the future we need indexes that depend on the locale (and a lot of other changes).
>
I agree. I've been looking at the web on this subject a lot lately. I am **NOT** a microslop fan, but SQL-SERVER even lets a user define a language(maybe encoding) down to the column level!
I've been reading on GNU-C and on languages, encoding, and localization.
http://pauillac.inria.fr/~lang/hotlist/free/licence/fsf96/drepper/paper-1.html
http://h21007.www2.hp.com/dspp/tech/tech_TechSingleTipDetailPage_IDX/1,2366,1222,00.html
There are three basic approaches to doing different langauges in computerized text:
A/ various adaptations of the 8 bit character set, I.E. the ISO-8859-x series.
One byte per character.
Easy storing, small size for a string.
Easy storing, if english characters, 100% efficient use of storage space.
Easy processing between applications, works well in the stream model of *nix
Easy processing in applications, a byte is a character.
Easy string handling, NOY NULL bytes in a string, except end of string.
NOT easy to know encoding from inherently in the document.
This is not the way of the future.
B/ wide characters
UTF16, UTF32, SHIFT-JIS-16, others
each character the same width, 2 or 4 bytes (2 bytes handles 99% of all languages)
Not so easy storing, if english characters, 50% to 75% loss of storage space.
Difficult processing between applications, does NOT work well in the stream model of *nix
Easy processing in applications, a set width of bits/bytes is a character.
Difficult string handling, MANY NULL bytes in a string, especially if in English.
Moderately easy to tell encoding/language in the document.
********This should be how Postgress stores data internally.********
C/ Multibyte characters
UTF8
variable width for different characters 1-5
Not so easy storing, if non english characters, 50% to 80% loss of storage space,
(in reality, most common western languages hover aournd 5-20% loss of storage space
most common non western languages hover aournd 40-60%% loss of storage space)
Easy processing between applications, works well in the stream model of *nix
Difficult processing in applications, a variable number of bytes is a character.
Easy string string handling, ONE NULL byte in a string.
Moderately easy to tell encoding/language in the document.
********This is how Postgress should default to sending data OUT of the application,
i.e. to the display or the web, or other system applications********
>
From | Date | Subject | |
---|---|---|---|
Next Message | Stephan Szabo | 2003-08-13 16:31:31 | Re: Sorting Problem |
Previous Message | Alexander Rüegg | 2003-08-13 16:02:05 | Re: Tsearch2 lexeme position |