Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

From: Hannu Krosing <hannu(at)trust(dot)ee>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <maillist(at)candle(dot)pha(dot)pa(dot)us>, Daniel Kalchev <daniel(at)digsys(dot)bg>, Hiroshi Inoue <Inoue(at)tpf(dot)co(dot)jp>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem
Date: 1999-06-09 18:32:03
Message-ID: 375EB323.799FCD63@trust.ee
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:
>
> Bruce Momjian <maillist(at)candle(dot)pha(dot)pa(dot)us> writes:
> > This certainly explains it. With locale enabled, LIKE does not use
> > indexes because we can't figure out how to do the indexing trick with
> > non-ASCII character sets because we can't figure out the maximum
> > character value for a particular encoding.
>
> We don't actually need the *maximum* character value, what we need is
> to be able to generate a *slightly larger* character value.
>
> For example, what the parser is doing now:
> fld LIKE 'abc%' ==> fld <= 'abc\377'
> is not even really right in ASCII locale, because it will reject a
> data value like 'abc\377x'.
>
> I think what we really want is to generate the "next value of the
> same length" and use a < comparison. In ASCII locale this means
> fld LIKE 'abc%' ==> fld < 'abd'
> which is reliable regardless of what comes after abc in the data.
> The trick is to figure out a "next" value without assuming a lot
> about the local character set and collation sequence.

in single-byte locales it should be easy:

1. sort a char[256] array from 0-255 using the current locale settings,
do it once, either at startup or when first needed.

2. use binary search on that array to locate the last char before %
in this sorted array:
if (it is not the last char in sorted array)
then (replace that char with the one at index+1)
else (
if (it is not the first char in like string)
then (discard the last char and goto 2.
else (don't do the end restriction)
)

some locales where the string is already sorted may use special
treatment (ASCII, CYRILLIC)

> But I am worried whether this trick will work in multibyte locales ---
> incrementing the last byte might generate an invalid character sequence
> and produce unpredictable results from strcmp. So we need some help
> from someone who knows a lot about collation orders and multibyte
> character representations.

for double-byte locales something similar should work, but getting
the initial array is probably tricky

----------------
Hannu

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Hannu Krosing 1999-06-09 19:04:00 Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem
Previous Message Hub.Org News Admin 1999-06-09 17:38:24