Re: Hash join not finding which collation to use for string hashing

From: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Amit Langote <amitlangote09(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
Subject: Re: Hash join not finding which collation to use for string hashing
Date: 2020-01-30 20:18:05
Message-ID: 6D80287E-E70B-43D2-B144-1BA7DDA21040@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Jan 30, 2020, at 12:02 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> writes:
>> According to Tom:
>>> (BTW, before v12 the text '=' operator indeed did not care for collation,
>>> so this example would've worked. But the change in behavior is a
>>> necessary consequence of having invented nondeterministic collations,
>>> not a bug.)
>
>> I’m still struggling with that, because the four collations I used in
>> the example are all deterministic. I totally understand why having more
>> than one collation matters if you ask that your data be in sorted order,
>> as the system needs to know which ordering to use. But for equality, I
>> would think that deterministic collations are all interchangeable,
>> because they all agree on whether A = B, regardless of the collation
>> defined on column A and/or on column B. Maybe I’m wrong about that.
>
> Well, you're not wrong, but you're assuming much finer distinctions
> than the collation machinery actually makes (or than it'd be sane
> to ask it to make, IMO). We don't have a way to tell texteq that
> "well, we don't know what collation to assign to this operation,
> but it's okay to assume that it's deterministic". Nor does the
> parser have any way to know that texteq could be satisfied by
> that knowledge --- if it doesn't even know whether texteq cares
> about collation, how could it know that?

I agree. Having this in the parser seems really weird and unwholesome.

> There are other issues here too. Just because the query could
> theoretically be implemented without reference to any specific
> collation doesn't mean that that's a practical thing to do.
> It'd be unclear for instance whether we can safely use indexes
> that *do* have specific collations attached. We'd also lose
> the option to consider plans like mergejoins.
>
> If the parser understood that a particular operator behaved
> like text equality --- which it does not, and I guarantee you
> I will shoot down any proposal to hard-wire a parser test for
> that particular operator --- you could imagine assigning "C"
> collation when we have an unresolvable combination of
> deterministic collations for the inputs. That dodges the
> problem of not having any way to represent the situation.
> But it's still got implementation issues, in that such a
> collation choice probably won't match the collations of any
> indexes for the input columns.

Yeah, I disclaimed that idea in a subsequent email, but if you’re responding to my emails in the order that you receive them (which is totally reasonable), then you aren’t to know that yet.

>
> Another issue is that collations "bubble up" in the parse tree,
> so sneaking in a collation that's not supposed to be there per
> spec carries a risk of causing unexpected semantics further up.
> I think we could get away with that for the particular case of
> equality (which returns collation-less boolean), but this is
> another thing that makes the case narrower and less useful.

I was wondering if bubbling up (LeftCollation,RightCollation) would be ok. There are likely cases that can’t make use of that, but those places would just throw the same sort of error that they’re currently throwing, except they’d have a more useful error message because it would include which collations were mismatched.

> In the end, TBH, I'm not finding your example compelling enough
> to be worth putting in weird hacks for such cases. If you're
> joining columns of dissimilar collations, you're going to be
> finding it necessary to specify what collation to use in a lot
> of places ... so where's the value in making a weird special
> case for equality?

I agree with your position against weird hacks. If the only way to do this is a weird hack, then forget about it.

If I’m not putting upon your time too much, could you respond to my other email in this thread as to whether it sounds any better?


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-01-30 20:29:37 Re: Hash join not finding which collation to use for string hashing
Previous Message Robert Haas 2020-01-30 20:17:32 Re: Hash join not finding which collation to use for string hashing