From: | Junwang Zhao <zhjwpku(at)gmail(dot)com> |
---|---|
To: | Michael Paquier <michael(at)paquier(dot)xyz> |
Cc: | jian he <jian(dot)universality(at)gmail(dot)com>, Aleksander Alekseev <aleksander(at)timescale(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, Amit Langote <amitlangote09(at)gmail(dot)com>, "andreas(at)proxel(dot)se" <andreas(at)proxel(dot)se>, Robert Haas <robertmhaas(at)gmail(dot)com> |
Subject: | Re: general purpose array_sort |
Date: | 2024-11-09 14:45:10 |
Message-ID: | CAEG8a3KD7ZmQpxNhfPxyc0BjTTTUXiqb56VuMgB7Muu0+yV=qQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi Michael,
On Fri, Nov 8, 2024 at 8:52 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> On Thu, Nov 07, 2024 at 10:06:04PM +0800, Junwang Zhao wrote:
> > I'm afraid this can not be achieved in my current implementation, a simple
> > case is:
> >
> > SELECT array_sort('{foo,bar,null,CCC,Abc,bbc}'::text[]);
> > {Abc,bar,bbc,CCC,foo,NULL}
> > SELECT array_sort('{foo,bar,null,CCC,Abc,bbc}'::text[] COLLATE "C");
> > {Abc,CCC,bar,bbc,foo,NULL}
> >
> > SELECT array_sort(a COLLATE "C") FROM (VALUES
> > ('{foo,bar,null,CCC,Abc,bbc}'::text[] COLLATE "C"),
> > ('{foo,bar,null,CCC,Abc,bbc}'::text[])) v(a);
> > {Abc,CCC,bar,bbc,foo,NULL}
> > {Abc,CCC,bar,bbc,foo,NULL}
> >
> > Maybe add some documents to specify this?
>
> So, if I use that:
> CREATE COLLATION case_sensitive (provider = icu, locale = '');
> =# SELECT array_sort('{Abc,CCC,bar,bbc,foo,NULL}'::text[]
> COLLATE "case_sensitive");
> array_sort
> ----------------------------
> {Abc,bar,bbc,CCC,foo,NULL}
> (1 row)
> =# SELECT array_sort('{Abc,CCC,bar,bbc,foo,NULL}'::text[]
> COLLATE "C");
> array_sort
> ----------------------------
> {Abc,CCC,bar,bbc,foo,NULL}
> (1 row)
>
> What takes priority is the collation defined with the array_sort,
> which is fine:
> =# SELECT array_sort(a collate "case_sensitive") FROM
> (VALUES ('{foo,bar,null,CCC,Abc,bbc}'::text[]),
> ('{foo,bar,null,CCC,Abc,bbc}'::text[] COLLATE "C" )) v(a);
> array_sort
> ----------------------------
> {Abc,bar,bbc,CCC,foo,NULL}
> {Abc,bar,bbc,CCC,foo,NULL}
> (2 rows)
> =# SELECT array_sort(a collate "C") FROM
> (VALUES ('{foo,bar,null,CCC,Abc,bbc}'::text[]),
> ('{foo,bar,null,CCC,Abc,bbc}'::text[] COLLATE "case_sensitive" )) v(a);
> array_sort
> ----------------------------
> {Abc,CCC,bar,bbc,foo,NULL}
> {Abc,CCC,bar,bbc,foo,NULL}
> (2 rows)
>
> The case where the collation is defined in the set of values is a bit
> more troubling to me, as it depends on what the values want to be
> applied, still that's OK because the collation applied is the one
> coming from the set of values:
> =# SELECT array_sort(a) FROM
> (VALUES ('{foo,bar,null,CCC,Abc,bbc}'::text[]),
> ('{foo,bar,null,CCC,Abc,bbc}'::text[] COLLATE "case_sensitive" )) v(a);
> array_sort
> ----------------------------
> {Abc,bar,bbc,CCC,foo,NULL}
> {Abc,bar,bbc,CCC,foo,NULL}
> (2 rows)
> =# SELECT array_sort(a) FROM
> (VALUES ('{foo,bar,null,CCC,Abc,bbc}'::text[]),
> ('{foo,bar,null,CCC,Abc,bbc}'::text[] COLLATE "C" )) v(a);
> array_sort
> ----------------------------
> {Abc,CCC,bar,bbc,foo,NULL}
> {Abc,CCC,bar,bbc,foo,NULL}
> (2 rows)
>
> I am wondering if there are more fancy cases where the saved cache
> could force a state that would lead to puzzling results, say with
> different collations that should be applied. I'd recommend to
> research that more, to reflect that in the docs and to add tests that
> show what we should expect in these cases within 0001 because this
> new function is mimicking in the context of a function execution
> multiple query clauses where restrictions are applied when analyzing
> the query, close to the parser.
>
> For example, UNION and UNION ALL require a common collation when
> processing a set of expressions related to them, which would be OK.
> Perhaps I lack some imagination to be able to break things.
> --
> Michael
While trying to come up with more test cases, it comes to me if the
PG_GET_COLLATION() has already done the work to give array_sort
the right collation oid? I did not pass the typentry->typcollation but
PG_GET_COLLATION() to tuplesort_begin_datum.
I tried:
CREATE COLLATION case_sensitive (provider = icu, locale = '');
create table t1(a int, b text[] COLLATE "C");
create table t2(a int, b text[] COLLATE "case_sensitive");
insert into t1 values (1, '{foo,bar,null,CCC,Abc,bbc}'::text[]);
insert into t2 values (2, '{foo,bar,null,CCC,Abc,bbc}'::text[]);
select array_sort(b) from t1;
select array_sort(b) from t2;
Set breakpoint to see the collation oid, they all give the expected result.
For the following cases:
SELECT array_sort(a) FROM
(VALUES ('{foo,bar,null,CCC,Abc,bbc}'::text[]),
('{foo,bar,null,CCC,Abc,bbc}'::text[] COLLATE
"case_sensitive" )) v(a);
WITH t AS (select '{foo,bar,null,CCC,Abc,bbc}'::text[] a UNION ALL
select '{foo,bar,null,CCC,Abc,bbc}'::text[] COLLATE "case_sensitive"
a) SELECT array_sort(a) from t;
The collation seems to have been decided in select_common_collation of
the transform phase.
For:
WITH t AS (select '{foo,bar,null,CCC,Abc,bbc}'::text[] COLLATE "C" a
UNION ALL select
'{foo,bar,null,CCC,Abc,bbc}'::text[] COLLATE "case_sensitive"a)
SELECT array_sort(a) from t;
ERROR: collation mismatch between explicit collations "C" and "case_sensitive"
LINE 2: '{foo,bar,null,CCC,Abc,bbc}'::text[] COLLATE "case_sens...
merge_collation_state gives out an ERROR since its explicit collation.
But for implicit collation,
select_common_collation sets InvalidOid to context.collation, so the
following works:
select b from t1 UNION ALL select b from t2;
But since the context has the InvalidOid as collation,
PG_GET_COLLATION() in arrary_sort
got InvalidOid, the following errors:
WITH t3 AS (select b from t1 UNION ALL select b from t2) select
array_sort(b) from t3;
ERROR: could not determine which collation to use for string comparison
HINT: Use the COLLATE clause to set the collation explicitly.
The error message comes from tuplesort_begin_datum's call stack, we
can do explicit COLLATE to make it work:
WITH t3 AS (select b from t1 UNION ALL select b from t2) select
array_sort(b collate "C") from t3;
Based on the above analysis, I think it's ok to pass PG_GET_COLLATION()
to tuplesort_begin_datum.
PFA v14 with Robert's comment addressed.
--
Regards
Junwang Zhao
Attachment | Content-Type | Size |
---|---|---|
v14-0001-general-purpose-array_sort.patch | application/octet-stream | 15.3 KB |
v14-0002-support-sort-order-and-nullsfirst-flag.patch | application/octet-stream | 8.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Nathan Bossart | 2024-11-09 15:59:42 | Re: New GUC autovacuum_max_threshold ? |
Previous Message | David G. Johnston | 2024-11-09 14:28:18 | Re: New "single" COPY format |