Re: case insensitive collation of Greek's sigma

From: Frank Limpert <frank_limpert(at)yahoo(dot)com>
To: Jakub Jedelsky <jakub(dot)jedelsky(at)gooddata(dot)com>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: case insensitive collation of Greek's sigma
Date: 2021-12-01 07:50:56
Message-ID: ca27a92d-9af1-7f39-c04d-44ffcd05de64@yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Am 26.11.21 um 08:37 schrieb Jakub Jedelsky:
> Hello,
>
> during our tests of Postgres with ICU we found an issue with ILIKE of
> upper and lowercase sigma (Σ). The letter has two lowercase variants σ
> and ς (at the end of a word). I'm working with en_US and en-US-x-icu
> collations and results are a bit unexpected - they are inverted:
>
> postgres=# SELECT
> postgres-# 'ΣΣ' ILIKE 'σσ' COLLATE "en_US",
> postgres-# 'ΣΣ' ILIKE 'σς' COLLATE "en_US"
> postgres-# ;
>  ?column? | ?column?
> ----------+----------
>  t        | f
> (1 row)
>
> postgres=# SELECT
> postgres-# 'ΣΣ' ILIKE 'σσ' COLLATE "en-US-x-icu",
> postgres-# 'ΣΣ' ILIKE 'σς' COLLATE "en-US-x-icu";
>  ?column? | ?column?
> ----------+----------
>  f        | t
> (1 row)
>
> I run those commands on the latest (14.1) official docker image.
>
> Is it possible to unify the behaviour?And which one is correct from
> the community point of view?
>
> If I could start, I think both results are wrong as both should return
> True. If I got it right, in the background there is a lower() function
> running to compare strings, which is not enough for such cases (until
> the left side isn't taken as a standalone word).
>
> Thanks,
>
> - jj

Have you seen the subtle intricacies in this example?

=> SELECT 'ΣΣ Μ' ILIKE 'σσ Μ' COLLATE "en_US" AS c0,
          'ΣΣ Μ' ILIKE 'σς Μ' COLLATE "en_US" AS c1,
          'ΣΣ Μ' ~*    'σσ Μ' COLLATE "el-GR-x-icu" AS c2,
          'ΣΣ Μ' ~*    'σς Μ' COLLATE "el-GR-x-icu" AS c3,
          'ΣΣ Μ' ILIKE 'σσ Μ' COLLATE "el-GR-x-icu" AS c4,
          'ΣΣ Μ' ILIKE 'σς Μ' COLLATE "el-GR-x-icu" AS c5,
          'ΣΣ Μ' ~*    'σσ Μ' COLLATE "en-US-x-icu" AS c6,
          'ΣΣ Μ' ~*    'σς Μ' COLLATE "en-US-x-icu" AS c7,
          'ΣΣ Μ' ILIKE 'σσ Μ' COLLATE "en-US-x-icu" AS c8,
          'ΣΣ Μ' ILIKE 'σς Μ' COLLATE "en-US-x-icu" AS c9;
 c0 | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8 | c9
----+----+----+----+----+----+----+----+----+----
 t  | f  | t  | t  | f  | t  | t  | t  | f  | t
(1 row)

Obviously, the ILIKE operator is really strict regarding to the correct
letter at the end of the word. The regular expression operator works as
you expected.

Happy computing...
Frank

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Dilip Kumar 2021-12-01 08:50:38 Re: Max connections reached without max connections reached
Previous Message Patrick FICHE 2021-12-01 07:50:44 RE: Wildcarding json keys in json query