From: | Marti Raudsepp <marti(at)juffo(dot)org> |
---|---|
To: | Torsten Zuehlsdorff <foo(at)meisterderspiele(dot)de> |
Cc: | Merlin Moncure <mmoncure(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org |
Subject: | Re: How to convert HEX to ASCII? |
Date: | 2011-12-02 16:42:01 |
Message-ID: | CABRT9RDai1xCvvJB28ycA=jF43vO1rANLjbq-qNBc0YrKSHBAw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Fri, Dec 2, 2011 at 16:16, Torsten Zuehlsdorff
<foo(at)meisterderspiele(dot)de> wrote:
> But i clearly have a missunderstanding of other chars, like umlauts or utf-8
> chars. This, for example, should return a 'ö':
>
> # SELECT chr(x'C3B6'::int);
> chr
> -----
> 쎶
> (1 row)
That gives you the Unicode codepoint C3B6, but %C3%B6 is UTF-8-encoded
and actually decodes to the codepoint 00F6.
There is a fundamental problem that a decoded URL may actually be a
binary string -- it might not have a textual representation at all.
But if text is what you want, RFC3986 strongly suggests using UTF-8
for encoding text strings in URLs, and that works almost always in the
real world.
So the *right* way is to first convert the URL to a binary "bytea"
type by fixing all the % escapes, then convert that to UTF-8 encoding
to handle multibyte characters.
What I came up with is far from elegant because PostgreSQL lacks
convenient functions for bytea manipulation (no bytea_agg, etc).
Stealing a little from Merlin, this is what it looks like:
CREATE OR REPLACE FUNCTION url_decode(input text) RETURNS text
LANGUAGE plpgsql IMMUTABLE STRICT AS $$
DECLARE
bin bytea = '';
byte text;
BEGIN
FOR byte IN (select (regexp_matches(input, '(%..|.)', 'g'))[1]) LOOP
IF length(byte) = 3 THEN
bin = bin || decode(substring(byte, 2, 2), 'hex');
ELSE
bin = bin || byte::bytea;
END IF;
END LOOP;
RETURN convert_from(bin, 'utf8');
END
$$;
db=# select url_decode('Hell%C3%B6%20World%21');
url_decode
--------------
Hellö World!
db=# select url_decode('%EC%8E%B6');
url_decode
------------
쎶
This will break for binary-encoded data in URLs, though.
db=# select url_decode('%fa%fa%fa');
ERROR: invalid byte sequence for encoding "UTF8": 0xfa
CONTEXT: PL/pgSQL function "url_decode" line 13 at RETURN
----
On Fri, Dec 2, 2011 at 17:46, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> set client_encoding to latin1;
> postgres=# select unencode('Hell%C3%B6%20World%21');
> unencode
> ---------------
> Hellö World!
> (1 row)
Sorry, but AFAICT this makes a mess of encodings and only works by
pure luck. The server thinks it's sending the client LATIN1 text, but
it's actually UTF8-encoded and the last decoding step is done by your
terminal.
Regards,
Marti
From | Date | Subject | |
---|---|---|---|
Next Message | Merlin Moncure | 2011-12-02 16:58:59 | Re: How to convert HEX to ASCII? |
Previous Message | Merlin Moncure | 2011-12-02 15:46:13 | Re: How to convert HEX to ASCII? |