Re: String encoding during connection "handshake"

From: "Trevor Talbot" <quension(at)gmail(dot)com>
To: "sulfinu(at)gmail(dot)com" <sulfinu(at)gmail(dot)com>
Cc: "Alvaro Herrera" <alvherre(at)alvh(dot)no-ip(dot)org>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: String encoding during connection "handshake"
Date: 2007-11-28 19:14:44
Message-ID: 90bce5730711281114p4d8720aeke7c69ee152c8d44e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/28/07, sulfinu(at)gmail(dot)com <sulfinu(at)gmail(dot)com> wrote:

> Yes, you support (and worry about) encodings simply because of a C limitation
> dating from 1974, if I recall correctly...
> In Java, for example, a "char" is a very well defined datum, namely a Unicode
> point. While in C it can be some char or another (or an error!) depending on
> what encoding was used. The only definition that stands up is that a "char"
> is a byte. Its interpretation is unsure and unsafe (see my original problem).

It's not really that simple. Java, for instance, does not actually
support Unicode characters / codepoints at the base level; it merely
deals in UTF-16 code units. (The critical difference is in surrogate
pairs.) You're still stuck dealing with a specific encoding even in
many modern languages.

PostgreSQL's encoding support is not just about languages though, it's
also about client convenience. It could simply choose a single
encoding and parrot data to and from the client, but it also does
on-the-fly conversion when a client requests it. It's a very useful
feature, and many mature networked applications support similar
things. An easy example is the World Wide Web itself.

> I implied that a cluster should have a single encoding that covers the whole
> Unicode set. That would certainly satisfy everybody.

Note that it might not. Unicode does not encode *every* character, and
in some cases there is no round-trip mapping between it and other
character sets. The result could be a loss of semantic data. I suspect
it actually would satisfy everyone in PostgreSQL's case, but it's not
something you can assume without checking.

> > This has nothing to do with C by the way. C has many features that
> > allow you to work with different encodings. It just doesn't force you
> > to use any particular one.

> Yes, my point exactly! C forces you to worry about encoding. I mean, if you're
> not an ASCII-only user ;)

For a networked application, you're stuck worrying about the encoding
regardless of language. UTF-8 is the most common Internet transport,
for instance, but that's not the native internal encoding used by Java
and most other Unicode processing platforms to date. That's fairly
simple since it's still only a single character set, but if your
application domain predates Unicode, you can't avoid dealing with the
legacy encodings at some level anyway.

As I implied earlier, I do think it would be worthwhile for PostgreSQL
to move toward handling it better, so I'm not saying this is a bad
idea. It's just that it's a much more complex topic than it might seem
at first glance.

I'm glad you got something working for you.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2007-11-28 19:16:16 Re: [HACKERS] Time to update list of contributors
Previous Message Stefan Kaltenbrunner 2007-11-28 19:14:43 Re: [HACKERS] Time to update list of contributors