From: | Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | t-ishii(at)sra(dot)co(dot)jp, Goran Thyni <goran(at)kirra(dot)net>, PostgreSQL-development <hackers(at)postgreSQL(dot)org> |
Subject: | Re: [HACKERS] indexable and locale |
Date: | 1999-10-19 00:55:17 |
Message-ID: | 199910190055.JAA16894@ext16.sra.co.jp |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
> Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp> writes:
> >> Attached is a patch to the old problem discussed feverly before 6.5.
>
> > ... I think your pacthes break
> > non-ascii multi-byte character sets data and should be surrounded by
> > #ifdef LOCALE rather than replacing current codes surrounded by
> > #ifndef LOCALE.
>
> I am worried about this patch too. Under MULTIBYTE could it
> generate invalid characters?
I assume you are talking about following code fragment in the pacthes:
prefix[prefixlen]++;
This would not generate invalid characters under MULTIBYTE since it skips the
multi-byte characters by:
if ((unsigned) prefix[prefixlen] < 126)
This would not make non-ASCII multi-byte characters indexable,
however.
> Also, do all non-ASCII locales sort
> codes 0-126 in the same order as ASCII? I didn't think they do,
> but I'm not an expert.
As far as I know they do. At least all encodings MULTIBYTE mode can
handle have same code point as ASCII in 0-126 range. They have
following characteristics:
o code point 0x00-0x7f are compatible with ASCII.
o code point over 0x80 are variable length multi-byte characters. For
example, ISO-8859-1 (Germany, Fernch etc...) has the multi-byte
length to always 1, while EUC_JP (Japanese) has 2 to 3.
> The approach I was considering for fixing the problem was to use a
> loop that would repeatedly try to generate a string greater than the
> prefix string. The basic loop step would increment the rightmost
> byte as Goran has done (or, if it's already up to the limit, chop
> it off and increment the next character position). Then test to
> see whether the '<' operator actually believes the result is
> greater than the given prefix, and repeat if not. This avoids making
> any strong assumptions about the sort order of different character
> codes. However, there are two significant issues that would have
> to be surmounted to make it work reliably:
Sounds good idea.
> 1. In MULTIBYTE mode incrementing the rightmost byte might yield
> an illegal multibyte character. Some way to prevent or detect this
> would be needed, lest it confuse the comparison operator. I think
> we have some multibyte routines that could be used to check for
> a valid result, but I haven't looked into it.
I don't think this is an issue as long as locale isn't enabled. For
multibyte encodings (Japanese, Chinese etc..) locale is totally
useless and usually I don't enable it.
> 2. I think there are some locales out there that have context-
> sensitive sorting rules, ie, a given character string may sort
> differently than you'd expect from considering the characters in
> isolation. For example, in German isn't "ss" treated specially?
> If "pqrss" does not sort between "pqrs" and "pqrt" then the entire
> premise of *both* sides of the LIKE optimization falls apart,
> because you can't be sure what will happen when comparing a prefix
> string like "pqrs" against longer strings from the database.
> I do not know if this is really a problem, nor what we could do
> to avoid it if it is.
I'm not sure about it but I am afraid it could be a problem. I think
real soultion would be supporting the standard CREATE COLLATION.
---
Tatsuo Ishii
From | Date | Subject | |
---|---|---|---|
Next Message | Hiroshi Inoue | 1999-10-19 01:02:42 | RE: [HACKERS] mdnblocks is an amazing time sink in huge relations |
Previous Message | Robert E. Bruccoleri | 1999-10-18 21:04:10 | Another historical message from the early days of PostgreSQL development |