Re: Slightly insane use of USE_WIDE_UPPER_LOWER in pg_trgm

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Slightly insane use of USE_WIDE_UPPER_LOWER in pg_trgm
Date: 2013-04-07 17:41:49
Message-ID: CAPpHfdvJKMz=VbBAx3GrRaU=ZNEGbtGh0LX2OWSfeEMF=6u54A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Apr 7, 2013 at 7:43 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> While reviewing the latest incarnation of the regex indexing patch,
> I noticed that make_trigrams() in contrib/pg_trgm/trgm_op.c is coded
> so that if USE_WIDE_UPPER_LOWER is not set, it ignores multibyte
> character boundaries and just makes trigrams from 3-byte substrings.
> This seems slightly insane, not least because there's an Assert there
> that will fail if it's fed any multibyte characters. I suppose no one
> has actually tried this code with non-ASCII data on machines where
> USE_WIDE_UPPER_LOWER isn't set; at least not with Asserts turned on.
> (Considering that even my favorite dinosaur HPUX machine has got both
> HAVE_WCSTOMBS and HAVE_TOWLOWER, it may well be that there *aren't* any
> such machines anymore.)
>
> So I'm inclined to remove the two #ifdef USE_WIDE_UPPER_LOWER tests
> in trgm_op.c, and just use the multibyte-aware code all the time.
> A downside of this is that if there is indeed anyone out there storing
> non-ASCII trigrams on a machine without USE_WIDE_UPPER_LOWER, their
> indexes would break if they pg_upgrade to 9.3. OTOH their indexes would
> break anyway if they rebuilt against a more modern libc, or built with
> Asserts on.
>
> If we don't do this then we'll have to complicate the regex indexing
> patch some more, since it's currently imagining that cnt_trigram()
> is always the way to make storable trigrams from raw text, and this
> is just wrong in the existing non-USE_WIDE_UPPER_LOWER code path
>

+1 for removing #ifdef USE_WIDE_UPPER_LOWER tests. Even if it works
somewhere with non-ASCII data without USE_WIDE_UPPER_LOWER then anyway it's
a buggy logic with invalid results.

It's also likely we can change
if (pg_database_encoding_max_length() > 1)
into something like
if (pg_database_encoding_max_length() > 1 && bytelen != charlen)

------
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2013-04-07 18:00:17 Re: Slightly insane use of USE_WIDE_UPPER_LOWER in pg_trgm
Previous Message Tom Lane 2013-04-07 15:43:26 Slightly insane use of USE_WIDE_UPPER_LOWER in pg_trgm