Re: DB encoding, locale and indexes

From: Sterfield <sterfield(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: DB encoding, locale and indexes
Date: 2015-02-05 15:18:39
Message-ID: CAPf6=kei-VXJ5AvtqxpLFyEpU--iM=gC6r4rukMvrpCxcakKvA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

2015-02-05 15:56 GMT+01:00 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:

> Sterfield <sterfield(at)gmail(dot)com> writes:
> > I'm a sysadmin working for an application that stores all its data in a
> PG
> > database.
> > Currently, the cluster has its encoding set to UTF-8, and the locale
> (both
> > LC_COLLATE and LC_CTYPE) is set to 'en_US.UTF-8'.
>
> > I discovered recently that the indexes created on varchar fields are not
> > working for LIKE operator, as they are created without the correct class
> > operator (as stated in
> > http://www.postgresql.org/docs/9.2/static/indexes-types.html)
>
> Right, because en_US.UTF-8 uses dictionary sort order rather than plain
> byte-by-byte sort.
>
> > The most straight-forward solution would be to create a second index on
> the
> > same field but with the class operator, in order to have indexes both for
> > =, >, < operators AND LIKE / regexp operators. Few additional indexes,
> some
> > diskspace eaten, problem solved.
>
> Yup.
>
> > However, some people are saying that nothing has to change on the index,
> > and that the only thing we have to do is to change the LC_COLLATE of each
> > databases to 'C', in order for the indexes to work without the class
> > operator.
>
> Yes, that is another possible solution, and it's documented. Keep in mind
> though that you can *not* just reach into pg_database and tweak those
> fields; if you did, all your indexes would be corrupt, because they'd no
> longer match the sort order the system is expecting. The only safe way to
> get there would be to dump and reload into a new database set up this way.
> (If you wanted to live dangerously, I guess you could manually tweak the
> pg_database fields and then REINDEX every affected index ... but this
> still involves substantial downtime, and I would not recommend doing it
> without practicing on a test installation.)
>

Yeah, I'll not take the risk. For current databases, I'll probably create
manually new indexes. For new environment, I'll change the LC_COLLATE to
'C'. I've spent some time re-creating a test environment, using encoding to
UTF8, locale to 'en_US.UTF-8' except LC_COLLATE set to 'C'. Nothing special
to report, the index is working as expected for LIKE operators, and I have
correct answers if I'm doing a LIKE '<something>é%'.

> You also have to ask whether any of your applications are expecting ORDER
> BY some-text-field to produce dictionary order rather than ASCII order.
>

Indeed, the order of the results is not the same with a LC_COLLATE to
'en_US.UTF-8' or LC_COLLATE to 'C', but I highly doubt that the application
is taking advantage of having an index already sorted.

> > - If I have unicode character stored in my database (for example 'é'),
> > and the LC_COLLATE set to 'C', how the index will behave if I do a
> query
> > with LIKE '<something>é%' ?
>
> It's still the same character, but it will sort in a possibly unexpected
> way.
>
> regards, tom lane
>

Many thanks for your help, guys, especially on this non-trivial subject (at
least, for me).

Cheers,

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Rémi Cura 2015-02-05 16:07:27 Re: How do I bump a row to the front of sort efficiently
Previous Message Tom Lane 2015-02-05 14:56:11 Re: DB encoding, locale and indexes