Re: BUG #1931: ILIKE and LIKE fails on Turkish locale

From: Victor Snezhko <snezhko(at)indorsoft(dot)ru>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Volkan YAZICI <yazicivo(at)ttnet(dot)net(dot)tr>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #1931: ILIKE and LIKE fails on Turkish locale
Date: 2006-09-22 17:48:11
Message-ID: u3bajokro.fsf@indorsoft.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-tr-genel

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

>> Hmm, there was something wrong in my utf-8 cluster setup, I
>> re-attached it and now see that isspace indeed fails:
>> LOG: plpgsql_convert_ident: ident=х, isspace array=ft, isspace(0)=f
>
> Hm. Given that we've not seen this reported elsewhere, I wonder if we
> shouldn't conclude that it's a FreeBSD bug.

OK, for 0x85 maybe, but see below.

> We could fix the problem by not using isspace() but some homegrown
> equivalent ... but that seems a tad ugly,

I wonder why simple create table (outside of plpgsql, with the table
name equal to the character D1 85) works? Are multibyte strings
handled differently in query parser compared to plpgsql?

> especially if it's to work around a problem on just one locale on
> one platform. Can anyone else check the result of isspace(0x85) ?

Not sure about one platform, we need to check this. And here are some
news: MacOSX's isspace also returns 1 on 0x85. So, 1.5 platforms.

The test I have run is:

#include <locale.h>
#include <ctype.h>
#include <stdio.h>

int main()
{
setlocale(LC_ALL, "");
printf("%d %d\n", isspace(0x85), isspace(0xA0));
printf("%d %d\n", iswspace(0x85), iswspace(0xA0));
return 0;
}

0xA0 is added because is's true space, see below.

On FreeBSD & Mac this yields the following output:
1 1
1 1

On recent ALTLinux (Compact 3.0), which uses ru_RU.UTF-8 locale by
default:
0 0
0 0

Here is an excerpt from the unicode character database
(http://www.unicode.org/Public/UNIDATA/UnicodeData.txt)

0085;<control>;Cc;0;B;;;;;N;NEXT LINE (NEL);;;;
00A0;NO-BREAK SPACE;Zs;0;CS;<noBreak> 0020;;;;N;NON-BREAKING SPACE;;;;

0x85 is the alternative next-line, and 0xA0 is the real space (note
the mark Zs - http://www.unicode.org/Public/UNIDATA/UCD.html
describes symbols with such marks as "Separator, Space")

So, that linux doesn't return true for 0x85 and 0xA0 characters - but
not because it's is* functions return false for any character above
0x80, but because it's LC_CTYPE doesn't know anything about these
particular characters! (iswspace calls also return false). So linux
has working isspace, but broken iswspace for these characters...

So, could you run one more test on that Fedora to see if at least
isspace() and isalpha() meet out expectations?

#include <locale.h>
#include <ctype.h>
#include <stdio.h>

int main()
{
int i;
setlocale(LC_ALL, "");
for (i=0x80; i <= 0xFF; ++i) {
if (isspace(i) != 0) printf("character 0x%x is a space\n", i);
if (isalpha(i) != 0) printf("character 0x%x is alphabetical\n", i);
if (isdigit(i) != 0) printf("character 0x%x is a digit\n", i);
}
return 0;
}

My FreeBSD lists a whole heck of characters:

character 0x85 is a space
character 0xa0 is a space
character 0xaa is alphabetical
character 0xb5 is alphabetical
character 0xba is alphabetical
character 0xc0 is alphabetical
... 0xc1-0xfe is alphabetical
character 0xff is alphabetical

--
WBR, Victor V. Snezhko
E-mail: snezhko(at)indorsoft(dot)ru

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Douglas Toltzman 2006-09-22 17:55:14 Re: BUG #1931: ILIKE and LIKE fails on Turkish locale
Previous Message Tom Lane 2006-09-22 16:48:50 Re: BUG #1931: ILIKE and LIKE fails on Turkish locale

Browse pgsql-tr-genel by date

  From Date Subject
Next Message Douglas Toltzman 2006-09-22 17:55:14 Re: BUG #1931: ILIKE and LIKE fails on Turkish locale
Previous Message Tom Lane 2006-09-22 16:48:50 Re: BUG #1931: ILIKE and LIKE fails on Turkish locale