Re: How can I return a UTF8 string from a hex representation in a latin9 database?

From: Erik Wienhold <ewie(at)ewie(dot)name>
To: PGUser2020 <pg(at)diorite(dot)uk>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: How can I return a UTF8 string from a hex representation in a latin9 database?
Date: 2023-11-20 02:59:35
Message-ID: q3w55c2ihmd62x2kpdw6sreazrolt2hsi5t4yjd6myoqw5y6sj@iimlufjzgkhw
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 2023-11-19 21:37 +0100, PGUser2020 wrote:
> I have a database for a legacy application that requires an 8 bit
> database (i.e. the application itself won't function on a UTF8
> database).

Technically speaking, UTF-8 is an 8-bit encoding. But I guess that
application would then show mojibake if UTF-8 were stored.

> Looking at ways to extend the functionality to be able to handle a few
> specified fields in Unicode.
>
> Had the idea to store a UTF8 string as either hex pairs or Base64
> inside a VARCHAR field, which is fine. I can do that.

Do you have to use existing Latin-9 text columns to store UTF-8? If not
then I'd go with bytea instead of text (varchar) if possible and also
supported by your client. Otherwise it may be difficult to distinguish
between "normal" Latin-9 text and the hex- or base64-encoded UTF-8.
Although bytea could also store anything, not just UTF-8, so you'd have
to deal with invalid data anyway.

> What needs to happen though, is to build a view, that will return the
> decoded hex (or b64) as a UTF8 string to a client which has specified
> client encoding UTF8.

Is the same client sending and reading that data? If yes, why can't the
client do the hex-encoding of the UTF-8 string and only send/read those
encoded strings so that database won't event see UTF-8? Why must the
database be involved in this custom encoding scheme instead of just
storing BLOBs (either as bytea or some encoded text)?

> I've tried various combinations of convert_from, and convert_to, and
> convert, but I just can't seem to get it to return the string a UTF8
> select to the client.
>
> So if I have this data:
>
> select * from mytable; mycolumn
> ------------------------------------------
> ceb120ceb220ceb320ceb420ceb520cf83cf84
>
> Then:
>
> select convert_from(decode(mycolumn, 'hex')::bytea, 'utf-8') from mytable where usr='BATCH';
>
> ERROR: character with byte sequence 0xce 0xb1 in encoding "UTF8" has no equivalent in encoding "LATIN9"
>
> So the database encoding is still relevant , this is expected by the
> description of convert_from in the documentation of course.
>
> Is there some combination of functions I can use to have a client
> select this column from this table in a LATIN9 database and get a UTF8
> string back?

The client can disable encoding conversion by setting client_encoding to
sql_ascii:

latin9_test=# show server_encoding;
server_encoding
-----------------
LATIN9
(1 row)

latin9_test=# set client_encoding to sql_ascii;
SET
latin9_test=# show client_encoding;
client_encoding
-----------------
SQL_ASCII
(1 row)

latin9_test=# select convert_from(decode('ceb120ceb220ceb320ceb420ceb520cf83cf84', 'hex'), 'sql_ascii');
convert_from
---------------------
α β γ δ ε στ
(1 row)

Maybe that's also an option for your client.

--
Erik

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Matthias Apitz 2023-11-20 06:30:17 pg_basebackup
Previous Message PGUser2020 2023-11-19 20:37:45 How can I return a UTF8 string from a hex representation in a latin9 database?