From: | Amit Khandekar <amit(dot)khandekar(at)enterprisedb(dot)com> |
---|---|
To: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: pg_dump/restore encoding woes |
Date: | 2013-09-25 07:19:55 |
Message-ID: | CACoZds2K+imiraAZNx85eb4ibxnK2EJ3mNJJ=5RP=XAh2HaWGQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 27 August 2013 20:06, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:
> On 26.08.2013 18:59, Tom Lane wrote:
>
>> Heikki Linnakangas<hlinnakangas(at)**vmware(dot)com <hlinnakangas(at)vmware(dot)com>>
>> writes:
>>
>> The pg_dump -E option just sets client_encoding, but I think it would be
>>> better for -E to only set the encoding used in the dump, and
>>> PGCLIENTENCODING env variable (if set) was used to determine the
>>> encoding of the command-line arguments. Opinions?
>>>
>>
>> I think this is going to be a lot easier said than done, but feel free
>> to see if you can make it work. (As you point out, we don't have
>> any client-side encoding conversion infrastructure, but I don't see
>> how you're going to make this work without it.)
>>
>
> First set client_encoding to PGCLIENTENCODING (ie. let libpq do its
> default thing), and run the queries that fetch the OIDs of any matching
> tables. Only after doing that, set client_encoding to that specified by -E.
> That's actually pretty easy to do, as pg_dump already issues all the
> queries that include user-given strings first.
>
> There's one small wrinkle in that: if the dump fails because the server
> sends an error, the error would come from the server in the -E encoding,
> because that's used as the client_encoding after the initial queries.
> Logically, the errors should be printed in PGCLIENTENCODING. But I think we
> can live with that.
>
> The pg_restore part is more difficult, as pg_restore needs to work without
> a database connection at all. There the conversion has to be done
> client-side.
>
>
> A second issue is whether we should divorce -E and PGCLIENTENCODING like
>> that, when they have always meant the same thing. You mentioned the
>> alternative of looking at pg_dump's locale environment to determine the
>> command line encoding --- would that be better?
>>
>
> Hmm. I wasn't thinking of looking at the locale environment as an
> alternative to the divorce, just as a way to determine the default for the
> client encoding if PGCLIENTENCODING is not set.
>
> It would be good for pg_dump to be consistent with other tools (reindexdb
> etc.). If we say that LC_CTYPE determines the encoding of command-line
> options, regardless of PGCLIENTENCODING, we should do the same in other
> tools, too. And that would be weird for the other tools. So no, I don't
> think we should do that.
>
> My plan is to assume that the command-line options are in the "client
> encoding", and the client encoding is determined by the following things,
> in this order:
>
> 1. client_encoding setting given explicitly in the command line, as part
> of the connection string.
> 2. PGCLIENTENCODING
> 3. LC_CTYPE
> 4. same as server_encoding
>
> The encoding to be used in the pg_dump output is a different concept, and
> there are reasons to *not* want the dump to be in the client encoding.
> Hence using server_encoding for that would be a better default than the
> current client encoding. This isn't so visible today, as client_encoding
> defaults to server_encoding anyway, but it will be if we set
> client_encoding='auto' (ie. look at LC_CTYPE) by default.
>
> There are certainly cases where you'd want to use the client encoding as
> the dump output encoding. For example if you pipe the pg_dump output to
> some custom script that mangles it. Or if you just want to look at the
> output, ie. if the output goes to a terminal. But for the usual case of
> taking a backup, using the server encoding makes more sense. One option is
> to check if the output is a tty (like psql does), and output the dump in
> client or server encoding depending on that (if -E is not given).
>
> So yeah, I think we should divorce -E and PGCLIENTENCODING. It feels that
> using PGCLIENTENCODING to specify the output encoding was more accidental
> than on purpose. Looking at the history, the implementation has been that
> way forever, but the docs were adjusted by commit b524cb36 in 2005 to
> document that fact.
>
> Here is a set of three patches that I've been working on:
>
> 0001-Divorce-pg_dump-E-option-**from-PGCLIENTENCODING.patch
>
> Separates pg_dump -E from PGCLIENTENCODING.
>
Since AH->encoding is now used only to represent dump encoding, we should
rename it as AH->dump_encoding.
-----
The standard_conforming_strings parameter is currently set in
setup_connection. You have moved it in setup_dump_encoding(). Is it in any
way related to encoding ? If not, I think we should keep it in
setup_connection().
-----
If a user supplies pg_dump --role=ROLENAME, we do the SET-ROLE in
setup_connection. The role name is in client encoding. In case of parallel
pg_dump, the setup_connection() in the parent process does it right, but in
the worker processes, the client_encoding is already set to the
dump_encoding when -E is given (parent has already updated it's
client_encoding to dump_encoding and I think the child inherits
client_encoding from parent), and then in the child when SET-ROLE is called
through setup_connection, it does not work because role name is in client
encoding, not dump encoding.
$ pg_dump -t "pöö" -E LATIN1 --role=uöö postgres
# Above succeeds
$ `which pg_dump` -t "pöö" -E LATIN1 -j 5 -f dumpdir -Fd postgres
--role=uöö
pg_dump: [archiver (db)] query failed: ERROR: role "uöö" does not exist
pg_dump: [parallel archiver] query was: SET ROLE "uöö"
pg_dump: [archiver (db)] query failed: ERROR: role "uöö" does not exist
pg_dump: [archiver (db)] query failed: ERROR: role "uöö" does not exist
pg_dump: [archiver (db)] query failed: ERROR: role "uöö" does not exist
pg_dump: [archiver (db)] query failed: ERROR: role "uöö" does not exist
>
> 0002-Set-client_encoding-auto-**in-all-client-utilities.patch
> Set client_encoding='auto' in all the client utilities, like we already
> did in psql. This fixes e.g "reindexdb -t" with a table name with non-ASCII
> chars
>
>
This patch looks good, I don't have any issues with this.
> 0003-Convert-object-names-to-**archive-encoding-before-matc.**patch
>
> Use iconv(3) in pg_restore to do encoding conversion in the client. This
> involves a lot of autoconf changes that I'm not 100% sure about, other than
> that it's pretty straightforward.
I haven't looked into this one yet.
>
>
> - Heikki
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>
From | Date | Subject | |
---|---|---|---|
Next Message | Marc Mamin | 2013-09-25 07:33:02 | invalid regexp crashes the server on windows or 9.3 |
Previous Message | Erik Rijkers | 2013-09-25 06:39:55 | Re: Minmax indexes |