Re: How to convert HEX to ASCII?

From: Marti Raudsepp <marti(at)juffo(dot)org>
To: Torsten Zuehlsdorff <foo(at)meisterderspiele(dot)de>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: How to convert HEX to ASCII?
Date: 2011-12-02 16:42:01
Message-ID: CABRT9RDai1xCvvJB28ycA=jF43vO1rANLjbq-qNBc0YrKSHBAw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Fri, Dec 2, 2011 at 16:16, Torsten Zuehlsdorff
<foo(at)meisterderspiele(dot)de> wrote:
> But i clearly have a missunderstanding of other chars, like umlauts or utf-8
> chars. This, for example, should return a 'ö':
>
> # SELECT chr(x'C3B6'::int);
> chr
> -----
> 쎶
> (1 row)

That gives you the Unicode codepoint C3B6, but %C3%B6 is UTF-8-encoded
and actually decodes to the codepoint 00F6.

There is a fundamental problem that a decoded URL may actually be a
binary string -- it might not have a textual representation at all.
But if text is what you want, RFC3986 strongly suggests using UTF-8
for encoding text strings in URLs, and that works almost always in the
real world.

So the *right* way is to first convert the URL to a binary "bytea"
type by fixing all the % escapes, then convert that to UTF-8 encoding
to handle multibyte characters.

What I came up with is far from elegant because PostgreSQL lacks
convenient functions for bytea manipulation (no bytea_agg, etc).
Stealing a little from Merlin, this is what it looks like:

CREATE OR REPLACE FUNCTION url_decode(input text) RETURNS text
LANGUAGE plpgsql IMMUTABLE STRICT AS $$
DECLARE
bin bytea = '';
byte text;
BEGIN
FOR byte IN (select (regexp_matches(input, '(%..|.)', 'g'))[1]) LOOP
IF length(byte) = 3 THEN
bin = bin || decode(substring(byte, 2, 2), 'hex');
ELSE
bin = bin || byte::bytea;
END IF;
END LOOP;
RETURN convert_from(bin, 'utf8');
END
$$;

db=# select url_decode('Hell%C3%B6%20World%21');
url_decode
--------------
Hellö World!

db=# select url_decode('%EC%8E%B6');
url_decode
------------

This will break for binary-encoded data in URLs, though.
db=# select url_decode('%fa%fa%fa');
ERROR: invalid byte sequence for encoding "UTF8": 0xfa
CONTEXT: PL/pgSQL function "url_decode" line 13 at RETURN

----

On Fri, Dec 2, 2011 at 17:46, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> set client_encoding to latin1;

> postgres=# select unencode('Hell%C3%B6%20World%21');
> unencode
> ---------------
> Hellö World!
> (1 row)

Sorry, but AFAICT this makes a mess of encodings and only works by
pure luck. The server thinks it's sending the client LATIN1 text, but
it's actually UTF8-encoded and the last decoding step is done by your
terminal.

Regards,
Marti

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Merlin Moncure 2011-12-02 16:58:59 Re: How to convert HEX to ASCII?
Previous Message Merlin Moncure 2011-12-02 15:46:13 Re: How to convert HEX to ASCII?