Quick Links

Re: Why do indexes and sorts use the database collation?

From:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>, Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Why do indexes and sorts use the database collation?
Date:	2023-11-13 21:36:24
Message-ID:	89e23afc-97b7-bf4c-44fb-3fa0a90dad58@enterprisedb.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 11/13/23 19:02, Andres Freund wrote:
> Hi,
>
> On 2023-11-11 23:19:55 -0800, Jeff Davis wrote:
>> On Fri, 2023-11-10 at 17:19 -0800, Andres Freund wrote:
>>> I guess you are arguing that the user didn't intend to create an
>>> index here?
>>
>> No, obviously the user should expect an index when a primary key is
>> created. But that doesn't mean that it necessarily needs to be ordered
>> according to the database collation.
>>
>> Unfortunately, right now the planner doesn't understand that an index
>> in the "C" locale can satisfy equality searches and constraint
>> enforcement for "en_US" (or any other deterministic collation). That's
>> probably the first thing to fix.
>>
>> Inequalities and ORDER BYs can't benefit from an index with a different
>> collation, but lots of indexes don't need that.
>
> But we don't know whether the index is used for that. If we just change the
> behaviour, there will be lots of pain around upgrades, because queries will
> continue to work but be dog slow.
>

Yeah. I don't quite agree with the initial argument that not specifying
the collation explicitly in CREATE TABLE or a query means the user does
not care about the collation. We do have the sensible behavior that if
you don't specify a collation, you get the database one as a default.

I don't think we can just arbitrarily override the default because we
happen to think "C" is going to be faster. If we could prove that using
"C" is going to produce exactly the same results as for the implicit
collation (for a given operation), then we can simply do that. Not sure
if such proof is possible, though.

For example, I don't see how we could arbitrarily override the collation
for indexes backing primary keys, because how would you know the user
will never do a sort on it? Not that uncommon with natural primary keys,
I think (not a great practice, but people do that).

Perhaps we could allow the PK index to have a different collation, say
by supporting something like this:

ALTER TABLE distributors ADD PRIMARY KEY (dist_id COLLATE "C");

And then the planner would just pick the right index, I think.

>
>>> Also, wouldn't the intent to use a different collation for the column
>>> be
>>> expressed by changing the column's collation?
>>
>> The column collation expresses the semantics of that column. If the
>> user has a database collation of "en_US", they should expect ORDER BY
>> on that column to be according to that locale unless otherwise
>> specified.
>
> That makes no sense to me. Either the user cares about ordering, in which case
> the index needs to be in that ordering for efficient ORDER BY, or they don't,
> in which neither index nor column needs a non-C collation. You partially
> premised your argument on the content of primary keys typically making non-C
> collations undesirable!
>

I may be missing something, but what's the disagreement here? If the
user cares about ordering, they'll specify ORDER BY with either an
explicit or the default collation. If the index collation matches, it
may be useful for the ordering.

Of course, if we feel entitled to create the primary key index with a
collation of our choosing, that'd make this unpredictable.

>
>>> OTOH, if we are choosing a groupagg, we might be able to implement
>>> that using
>>> an index, which is more likey to exist in the databases collation.
>>> Looks like
>>> we even just look for indexes that are in the database collation.
>>>
>>> Might be worth teaching the planner additional smarts here.
>>
>> Yeah, we don't need to force anything, we could just create a few paths
>> with appropriate path key information and cost them.
>
> I'm not sure it's quite that easy. One issue is obviously that this could lead
> to a huge increase in paths we need to keep around due to differing path
> keys. We might need to be a bit more aggressive about pruning such paths than
> I think we would be today.
>

Right. There's also the challenge that we determine "interesting
pathkeys" very early, and I'm not sure if we can decide which pathkeys
(for different collations) are cheaper at that point.

>
>>> - Teach the planner to use cheaper collations when ordering for
>>> reasons other
>>> than the user's direct request (e.g. DISTINCT/GROUP BY, merge
>>> joins).
>>
>> +1. Where "cheaper" comes from is an interesting question -- is it a
>> property of the provider or the specific collation? Or do we just call
>> "C" special?
>
> I'd think the specific collation. Even if we initially perhaps just get the
> default cost from the provider such, it structurally seems the sanest place to
> locate the cost.
>

ISTM it's about how complex the rules implemented by the collation are,
so I agree the cost should be a feature of collations not providers.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: Why do indexes and sorts use the database collation? at 2023-11-13 18:02:47 from Andres Freund

Responses

Re: Why do indexes and sorts use the database collation? at 2023-11-13 22:12:12 from Andres Freund
Re: Why do indexes and sorts use the database collation? at 2023-11-14 01:58:54 from Jeff Davis
Re: Why do indexes and sorts use the database collation? at 2023-11-14 04:48:38 from Laurenz Albe

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Roberto Mello	2023-11-13 21:40:08	[DOC] Add detail regarding resource consumption wrt max_connections
Previous Message	Bruce Momjian	2023-11-13 21:27:23	Re: "box" type description