Re: Why do indexes and sorts use the database collation?

From: Andres Freund <andres(at)anarazel(dot)de>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why do indexes and sorts use the database collation?
Date: 2023-11-11 01:19:43
Message-ID: 20231111011943.ktfppxrcqtjo66bg@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-11-10 16:03:16 -0800, Jeff Davis wrote:
> An "en_US" user doing:
>
> CREATE TABLE foo(t TEXT PRIMARY KEY);
>
> is providing no indication that they want an index tailored to their
> locale. Yet we are creating the index with the "en_US" collation and
> therefore imposing huge performance costs (something like 2X slower
> index build time than the "C" locale), and also huge dependency
> versioning risks that could lead to index corruption and/or wrong
> results.

I guess you are arguing that the user didn't intend to create an index here? I
don't think that is true - users know that pkeys create indexes. If we used C
here, users would often need to create a second index on the same column using
the actual database collation - I think we'd very commonly end up with
complaints that the pkey index doesn't work, because it's been automatically
created with a different collation than the column.

Also, wouldn't the intent to use a different collation for the column be
expressed by changing the column's collation?

> Similarly, a user doing:
>
> SELECT DISTINCT t FROM bar;
>
> is providing no indication that they care about the collation of "t"
> (we are free to choose a HashAgg which makes no ordering guarantee at
> all). Yet if we choose Sort+GroupAgg, the Sort will be performed in the
> "en_US" locale, which is something like 2X slower than the "C" locale.

OTOH, if we are choosing a groupagg, we might be able to implement that using
an index, which is more likey to exist in the databases collation. Looks like
we even just look for indexes that are in the database collation.

Might be worth teaching the planner additional smarts here.

> Thoughts?

I seriously doubt its a good idea to change which collations primary keys use
by default. But I think there's a decent bit of work we could do in the
planner, e.g:

- Teach the planner to take collation costs into account for costing - right
now index scans with "C" cost the same as index scans with more expensive
collations. That seems wrong even for equality lookups and would make it
hard to make improvements to prefer cheaper collations in other situations.

- Teach the planner to use cheaper collations when ordering for reasons other
than the user's direct request (e.g. DISTINCT/GROUP BY, merge joins).

I think we should also explain in our docs that C can be considerably faster -
I couldn't find anything in a quick look.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2023-11-11 02:32:27 Re: Adding facility for injection points (or probe points?) for more advanced tests
Previous Message Michael Paquier 2023-11-11 00:38:53 Re: maybe a type_sanity. sql bug