Quick Links

Re: Latest on CITEXT 2.0

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Latest on CITEXT 2.0
Date:	2008-06-26 17:09:37
Message-ID:	7998C08A-D40B-4081-A343-1EA1B3FA7976@kineticode.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Jun 26, 2008, at 10:02, Tom Lane wrote:

> BTW, I don't think you can use that same-length optimization for
> citext. There's no reason to think that upper/lowercase pairs will
> have the same length all the time in multibyte encodings.

I was wondering about that. I had been thinking of canonically-
equivalent stings and combining marks. Doing a quick test it looks
like combining marks are not equivalent. For example, this returns
false:

SELECT 'Ä'::text = 'Ä'::text;

At least with en_US.UTF-8. Hrm. It looks like my client makes them
both canonical, so I've attached a script demonstrating this issue.

Anyway, I was aware of different byte counts for canonical
equivalence, but not for differences between upper- and lowercase
characters. I'd certainly defer to your knowledge of how these things
truly work in PostgreSQL, Tom, and can of course easily remove that
optimization. So, are your certain about this?

Many thanks,

David

Attachment	Content-Type	Size
try.sql	application/octet-stream	34 bytes

In response to

Re: Latest on CITEXT 2.0 at 2008-06-26 17:02:19 from Tom Lane

Responses

Re: Latest on CITEXT 2.0 at 2008-06-26 20:59:23 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2008-06-26 17:22:22	Re: Regd: TODO Item
Previous Message	Tom Lane	2008-06-26 17:02:19	Re: Latest on CITEXT 2.0