Re: Encoding problems in PostgreSQL with XML data

From: Hannu Krosing <hannu(at)tm(dot)ee>
To: Merlin Moncure <merlin(dot)moncure(at)rcsonline(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject: Re: Encoding problems in PostgreSQL with XML data
Date: 2004-01-15 20:46:32
Message-ID: 1074199591.3292.12.camel@fuji.krosing.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Merlin Moncure kirjutas N, 15.01.2004 kell 18:43:
> Hannu Krosing wrote:

> > select
> > '<d/>'::xml == '<?xml version="1.0" encoding="utf-8"?>\n<d/>\n'::xml
>
> Right: I understand your reasoning here. Here is the trick:
>
> select '[...]'::xml introduces a casting step which justifies a
> transformation. The original input data is not xml, but varchar. Since
> there are no arbitrary rules on how to do this, we have some flexibility
> here to do things like change the encoding/mess with the whitespace. I
> am trying to find away to break the assumption that my xml data
> necessarily has to be converted from raw text.
>
> My basic point is that we are confusing the roles of storing and
> parsing/transformation. The question is: are we storing xml documents
> or the metadata that makes up xml documents? We need to be absolutely
> clear on which role the server takes on...in fact both roles may be
> appropriate for different situations, but should be represented by a
> different type. I'll try and give examples of both situations.
>
> If we are strictly storing documents, IMO the server should perform zero
> modification on the document. Validation could be applied conceptually
> as a constraint (and, possibly XSLT/XPATH to allow a fancy type of
> indexing). However there is no advantage that I can see to manipulating
> the document except to break the 'C' of ACID. My earlier comments wrt
> binary encoding is that there simply has to be a way to prevent the
> server mucking with my document.
>
> For example, if I was using postgres to store XML-EDI documents in a DX
> system this is the role I would prefer. Validation and indexing are
> useful, but my expected use of the server is a type of electronic xerox
> of the incoming document. I would be highly suspicious of any
> modification the server made to my document for any reason.

The current charset/encoding support can be evil in some cases ;(

The only solution seems to be keeping both server and client encoding as
ASCII (or just disable it)

The proper path to encodings must unfortunately do the encoding
conversions *after* parsing, when it is known, which parts of the
original query string should be changed.

Or, as you suggested, always encode anything outside plain ASCII (n<32
and n>127), both on input (can be done client-side) and output (IIRC
needs another type with different output function)

> Based on your suggestions I think you are primarily concerned with the
> second example. However, in my work I do a lot of DX and I see the xml
> document as a binary object. Server-side validation would be extremely
> helpful, but please don't change my document!

So the problem is not exactly XML, but rather problems with changing
encodings of "binary" strings that should not be changed.

I hope (but I'm not sure) that keeping client and server encodings the
same should prevent that.

> So, I submit that we are both right for different reasons.

Seems so.

-----------------
Hannu

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Brusser 2004-01-15 21:48:55 Postgres v.7.3.4 - Performance tuning
Previous Message Richard Huxton 2004-01-15 19:23:09 Re: Bug and/or feature? Complex data types in tables...