From: | "David E(dot) Wheeler" <david(at)kineticode(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: PATCH: CITEXT 2.0 v3 |
Date: | 2008-07-14 17:48:44 |
Message-ID: | EC8BD896-825A-4098-9A6E-6024DBF28078@kineticode.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Jul 14, 2008, at 07:24, Tom Lane wrote:
> "David E. Wheeler" <david(at)kineticode(dot)com> writes:
>> Could I supply two comparison files, one for Mac OS X with
>> en_US.UTF-8
>> and one for everything else, as described in the last three
>> paragraphs
>> here?
>
> The fallacy in that proposal is the assumption that there are only two
> behaviors out there.
Well, no, that's not the assumption at all. The assumption is that the
type works properly with multibyte characters under multibyte-aware
locales. So I want to have tests to ensure that such is true by having
multibyte characters run under a very specific locale and platform. I
don't really care what platform or locale; the point is to make sure
that the type is actually multibyte-aware.
> Let me recalibrate your thoughts a bit: so far
> I have tried citext on three different machines (Mac, Fedora 8, HPUX),
> and I got three different answers from those tests. That's despite
> endeavoring to make the database locales match ... which is less than
> trivial in itself because they use three slightly different
> spellings of
> "en_US.UTF8".
<rant>
This is a truly pitiful state of affairs. Rhetorical question: Why is
there no standardization of locales? I'm sure there are a lot of
opinions out there (should all uppercase chars should precede all
lowercase chars or be mixed in with lowercase chars), but I should
think that, in this day and age, there would be some sort of standard
defining locales and how they work -- and to allow such opinions to be
expressed by different locales, not in the same locale names on
different platforms.
</rant>
> Given that you were more or less deliberately testing corner cases,
> I think it's quite likely that the number of observable reactions from
> N platforms would be more nearly O(N) than O(1).
To me they're not corner cases. To me it is just, "given a specific
platform/locale, does CITEXT respect the locale's rules?" I don't care
to test all platforms and locales (I'm not *that* stupid :-)).
> In the real world, to the extent that we are able to control the
> locale
> of the regression tests, we make it "C" --- and to a large extent we
> can't control it at all, which means you have another uncontrolled
> variable besides platform. So in the current universe there is
> absolutely no value in submitting locale-specific tests for a contrib
> module.
Then how do we know that it will continue to be locale-aware over
time? Someone could replace the comparison function with one that just
lowercases ASCII characters, like CITEXT 1 does, and no tests would
fail. How do you prevent that from happening without being hyper-
vigilant (and never leaving the project, I might add)?
> I see some discussion in the thread about improving the situation, but
> until we are able to decouple database locale from environment locale,
> I doubt we'll be able to do a whole lot about automating this kind
> of test. There are too many variables at the moment.
Is the decoupling of database locale from environment locale likely to
happen anytime soon? Now that I've written CITEXT, I dare say that
such might become my top-desired feature (aside from replication).
Thanks for the discussion, much appreciated, and I'm learning a ton. I
retain the right to be opinionated, however. ;-)
Best,
David
From | Date | Subject | |
---|---|---|---|
Next Message | Kless | 2008-07-14 17:49:15 | Re: Fwd: Proposal - UUID data type |
Previous Message | David E. Wheeler | 2008-07-14 17:36:48 | Re: PATCH: CITEXT 2.0 v3 |