From: | Hannu Krosing <hannu(at)trust(dot)ee> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Bruce Momjian <maillist(at)candle(dot)pha(dot)pa(dot)us>, Daniel Kalchev <daniel(at)digsys(dot)bg>, Hiroshi Inoue <Inoue(at)tpf(dot)co(dot)jp>, pgsql-hackers(at)postgreSQL(dot)org |
Subject: | Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem |
Date: | 1999-06-09 18:32:03 |
Message-ID: | 375EB323.799FCD63@trust.ee |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Tom Lane wrote:
>
> Bruce Momjian <maillist(at)candle(dot)pha(dot)pa(dot)us> writes:
> > This certainly explains it. With locale enabled, LIKE does not use
> > indexes because we can't figure out how to do the indexing trick with
> > non-ASCII character sets because we can't figure out the maximum
> > character value for a particular encoding.
>
> We don't actually need the *maximum* character value, what we need is
> to be able to generate a *slightly larger* character value.
>
> For example, what the parser is doing now:
> fld LIKE 'abc%' ==> fld <= 'abc\377'
> is not even really right in ASCII locale, because it will reject a
> data value like 'abc\377x'.
>
> I think what we really want is to generate the "next value of the
> same length" and use a < comparison. In ASCII locale this means
> fld LIKE 'abc%' ==> fld < 'abd'
> which is reliable regardless of what comes after abc in the data.
> The trick is to figure out a "next" value without assuming a lot
> about the local character set and collation sequence.
in single-byte locales it should be easy:
1. sort a char[256] array from 0-255 using the current locale settings,
do it once, either at startup or when first needed.
2. use binary search on that array to locate the last char before %
in this sorted array:
if (it is not the last char in sorted array)
then (replace that char with the one at index+1)
else (
if (it is not the first char in like string)
then (discard the last char and goto 2.
else (don't do the end restriction)
)
some locales where the string is already sorted may use special
treatment (ASCII, CYRILLIC)
> But I am worried whether this trick will work in multibyte locales ---
> incrementing the last byte might generate an invalid character sequence
> and produce unpredictable results from strcmp. So we need some help
> from someone who knows a lot about collation orders and multibyte
> character representations.
for double-byte locales something similar should work, but getting
the initial array is probably tricky
----------------
Hannu
From | Date | Subject | |
---|---|---|---|
Next Message | Hannu Krosing | 1999-06-09 19:04:00 | Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem |
Previous Message | Hub.Org News Admin | 1999-06-09 17:38:24 |