Escape sequences in string literals insufficient?

From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Escape sequences in string literals insufficient?
Date: 2003-06-18 18:57:31
Message-ID: Pine.LNX.4.44.0306182025170.2501-100000@peter.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

In recent times there has been an increasing amount of user questions that
indicate that some sort of functionality is missing in the available
escape sequences in string literals for "un-enterable" characters. This
problem manifests itself in one of two ways:

1. A user wants to enter a character with a known Unicode codepoint (a
16-bit value). He cannot directly use the \ddd notation, he needs to
manually convert the 16-bit value to UTF-8, which results in a sequence of
one or more bytes. This conversion is pretty hard to do manually and
obviously not nice. (This case assumes the user knows the server encoding
is UTF-8.)

2. A user wants to enter a random character in his client encoding, but he
cannot enter it in his keyboard. Say you want to enter the Euro sign.
The Euro sign is decimal 164, so you try '\244'. But the byte value
represented by this escape mechanism is interpreted in the server
encoding, and if you don't know that (which you shouldn't be required to),
you cannot use this. If the server encoding is UTF-8, this is an illegal
byte sequence.

Obviously, the \ddd notation missed the train when the world was
introduced to multibyte encodings and encoding conversion. I guess we
cannot change it anymore, but we need a new mechanism.

One possibility is to introduce the notation from Java, '\uXXXX'
(hexadecimal digits) to designate a Unicode character. This would then be
converted to whatever the server encoding is. Obviously, this would solve
problem #1 from above. Problem #2 would be solved in an indirect way, the
user would then have to look up the codepoint in Unicode always, instead
of in the client encoding.

Another possibility is to introduce a new notation that designates a
specific code point in the client encoding. Say we call it '\yXXXX', then
if your client encoding is ISO-8859-15 you can enter a Euro sign using
'\yA4', if your client encoding is UTF-8 you can enter it using '\y20AC'.
I'm not sure, however, whether all encodings know the concept of a
codepoint.

If you're concerned about adding more nonstandard escape sequences or how
to implement them given the variable-length data after the magic letter,
you can also think of these as a new function, so you could write: 'The
price is ' || unicode(0x20AC) || ' 200.' This is uglier but more
flexible.

Comments/better ideas?

--
Peter Eisentraut peter_e(at)gmx(dot)net

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2003-06-18 20:15:40 Two weeks to feature freeze
Previous Message Peter Eisentraut 2003-06-18 18:53:50 Re: Groups and roles