From: | Francisco Olarte <folarte(at)peoplecall(dot)com> |
---|---|
To: | Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com> |
Cc: | Yuriy Rusinov <yrusinov(at)gmail(dot)com>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Convert from hex to string |
Date: | 2015-11-26 07:12:59 |
Message-ID: | CA+bJJbxh_NU23pm4ha=0T6_54zXd0BCPX+hyLaDYnFichkrixQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hi Adrian:
On Wed, Nov 25, 2015 at 9:33 PM, Adrian Klaver
<adrian(dot)klaver(at)aklaver(dot)com> wrote:
> I will grant you that working with encodings is like working with
> timestamps, explicit is better. The thing I am having a problem with is how
> not knowing the context of the bytea value is different in the implicit vs
> explicit case:
> Granted the above is contrived and bound to fail, but the point is you need
> to know what created the bytea however it got there. Now if you are in
> charge of both ends of the process, then the above is your own fault.
> Otherwise, you are down to detective work on what encoding was used whether
> it was implicit or explicit. As the OP was working in a single context I am
> not seeing the issue in making use of that context to do the heavy lifting.
> For the use cases that you show I agree that a defined
> convert_to/encode/decode/convert_from chain is a best practice and something
> I had not really thought out, so thanks.
Well, I tend to think as a programmer. So I was thinking of the
scenario where you want to get some text from the database and move it
around, and you control all the code. What I was trying to point is
that using explicit all the character sets that matter are in the
database, so I can move it around freely, as I'm the one defining the
queries, while in the implicit case I have to know, or get from the
catalogs, the encoding of the database. I do not know what context the
OP was working and wanted to point he was mixing types. Postgres has a
lot of them, specially to/from text, and I've found the hard way that
lots of implicit conversions are great for one shot programs or
interactive tests, but relying on implicit type conversions for real
production code, put in a source, causes a lot of problems. In a
single session case you can even use implicit conversion + encode and
then paste the result into a convert_from adding quotes and x and it's
going to work, but if you write down that in code you are going to be
confused if something fails later, things like:
cdrs=> select encode('Año','hex');
encode
----------
41c3b16f
cdrs=> select convert_from('\x41c3b16f', 'UTF-8');
convert_from
--------------
Año
(1 row)
seem like you are converting back and forth, but then:
cdrs=> select convert_from(encode('Año','hex'), 'UTF-8');
ERROR: function convert_from(text, unknown) does not exist
LINE 1: select convert_from(encode('Año','hex'), 'UTF-8');
And also, the encode() example will give different results depending
on database encoding. Using explicit charsets and correct types frees
me from it. After all, to make the full round trip a
covert-to+encode+decode+convert-from is needed, and making it explciti
makes things easier.
After all, not everybody knows that all values in the database pass
through a conversion to/from text to be sent to / received from psql,
and that you can use it if you just want the hex digits in the current
database encoding:
cdrs=> select 'Año'::bytea;
bytea
------------
\x41c3b16f
(1 row)
But I doubt using cast instead of encode can be recomended.
And for the heavy lifting, you are just saving some keystrokes, which
IMO is a cheap price to pay for having a nearly self documenting
conversion chain.
Anyway, I think people should be told to respect types, and people
should be teach that strings are sequences of characters, that to do
what people think is 'hex encoding' ( two digits per *byte*, no
delimiter ), you need to first transform the string to bytes, then
hex-encode that. Nearly all the encoding problems I see is because
people thing strings are sequences of bytes, which they ceased to be
when multibyte encodings where detected ( and even without them, in
Java and I think some of the windows NT API Strings where sequences of
16 bits thingos, first UCS-2. then UTF16. Once people stop trying to
encode/decode strings directly normally they problems vanish.
Francisco Olarte.
From | Date | Subject | |
---|---|---|---|
Next Message | George Neuner | 2015-11-26 07:48:20 | Re: using a postgres table as a multi-writer multi-updater queue |
Previous Message | Jeff Janes | 2015-11-26 07:07:45 | Re: using a postgres table as a multi-writer multi-updater queue |