From: | Magnus Hagander <magnus(at)hagander(dot)net> |
---|---|
To: | Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> |
Cc: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, PostgreSQL WWW <pgsql-www(at)postgresql(dot)org> |
Subject: | Re: Corrupted subjects on the archive website |
Date: | 2015-09-29 10:36:21 |
Message-ID: | CABUevExjM5Uo52XxNHLz9DmyFoYgWWJJcm7s5oWu_CXeuB2c6A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-www |
On Wed, Sep 23, 2015 at 7:30 PM, Stefan Kaltenbrunner <
stefan(at)kaltenbrunner(dot)cc> wrote:
> On 09/23/2015 06:59 AM, Thomas Munro wrote:
> > Hi
> >
> > Why do some message display with corrupted subjects on the mailing
> > list archives site? The replies to the message below, but not the
> > message itself, are displayed with a corrupted subject. They appear
> > fine in my mail client though.
> >
> >
> http://www.postgresql.org/message-id/20150922134404.5050.75087@wrigleys.postgresql.org
> >
> > The website shows "Re: [BUGS] BUG #13632: violation de l'intégrité rQ1|
> ɕѥ".
> > My mail client shows "Re: [BUGS] BUG #13632: violation de l'intégrité
> > référentielle".
> >
> > The original message that displays correctly has the following raw
> header:
> >
> > Subject:
> =?utf-8?b?QlVHICMxMzYzMjogdmlvbGF0aW9uIGRlIGwnaW50w6lncml0w6kgcsOp?=
> > =?utf-8?q?f=C3=A9rentielle?=
> >
> > The reply that doesn't display correctly has the following raw header:
> >
> > Subject:
> =?UTF-8?B?UmU6IFtCVUdTXSBCVUcgIzEzNjMyOiB2aW9sYXRpb24gZGUgbCdpbnTDqWdyaXTDqSBy?=
> > =?UTF-8?B?w6lmw6lyZW50aWVsbGU=?=
> >
> > A wise denizen of #postgresql pointed out that 'UTF-8' decoded as
> > base64 produces 'Q1\377' of which we see at least the 'Q1' in the
> > corrupted string.
>
> I looked a bit at the code and did some testing - the difference between
> the original mail (which is stored and displayed correctly in the
> archives database) and the two replys that have it corrupted is how the
> line wrapping for the Subject is done(basically linebreak + space in the
> first version and linebreak+tab in the broken one).
>
> We use decode_header() from the python email package to parse headers
> and it is actually capable of correctly decoding both variants.
> However there is a special hack in our importer code citing
> http://bugs.python.org/issue504152 that removes \n\t unconditionally
> from the raw string.
> I dont know the details of why that was put in originally but that
> surely must be wrong in general because it removes the required
> seperation between different header words through a linear whitespace
> per RFC2047(because in this case it leaves no seperation at all causing
> header_decode() to go haywire).
> I think it was magnus who put that special case in so maybe he can shed
> some light on the issue this change was targeted at?
>
>
He can not, unfortunately. That was years ago and I don't have a testcase
around for it.
As discussed with Stefan, we need to set up a proper testbench to make sure
we don't break something else when/if we remove this change. It's on my
TODO list, and I just wanted to ack that in this thread. This is clearly a
bug in the archives code that needs to get fixed, it'll just take a bit
longer as we don't currently have a way to test across the 1M+ messages
that are in the archives today yet.
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
From | Date | Subject | |
---|---|---|---|
Next Message | Magnus Hagander | 2015-09-29 10:46:56 | Re: Deselecting "Receive Mail" on the website ml subscribe form is broken |
Previous Message | Alvaro Herrera | 2015-09-29 04:51:14 | Re: No easy way to join discussion in existing thread when not subscribed |