Re: Corrupted subjects on the archive website

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, PostgreSQL WWW <pgsql-www(at)postgresql(dot)org>
Subject: Re: Corrupted subjects on the archive website
Date: 2015-09-29 10:36:21
Message-ID: CABUevExjM5Uo52XxNHLz9DmyFoYgWWJJcm7s5oWu_CXeuB2c6A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-www

On Wed, Sep 23, 2015 at 7:30 PM, Stefan Kaltenbrunner <
stefan(at)kaltenbrunner(dot)cc> wrote:

> On 09/23/2015 06:59 AM, Thomas Munro wrote:
> > Hi
> >
> > Why do some message display with corrupted subjects on the mailing
> > list archives site? The replies to the message below, but not the
> > message itself, are displayed with a corrupted subject. They appear
> > fine in my mail client though.
> >
> >
> http://www.postgresql.org/message-id/20150922134404.5050.75087@wrigleys.postgresql.org
> >
> > The website shows "Re: [BUGS] BUG #13632: violation de l'intégrité rQ1|
> ɕѥ".
> > My mail client shows "Re: [BUGS] BUG #13632: violation de l'intégrité
> > référentielle".
> >
> > The original message that displays correctly has the following raw
> header:
> >
> > Subject:
> =?utf-8?b?QlVHICMxMzYzMjogdmlvbGF0aW9uIGRlIGwnaW50w6lncml0w6kgcsOp?=
> > =?utf-8?q?f=C3=A9rentielle?=
> >
> > The reply that doesn't display correctly has the following raw header:
> >
> > Subject:
> =?UTF-8?B?UmU6IFtCVUdTXSBCVUcgIzEzNjMyOiB2aW9sYXRpb24gZGUgbCdpbnTDqWdyaXTDqSBy?=
> > =?UTF-8?B?w6lmw6lyZW50aWVsbGU=?=
> >
> > A wise denizen of #postgresql pointed out that 'UTF-8' decoded as
> > base64 produces 'Q1\377' of which we see at least the 'Q1' in the
> > corrupted string.
>
> I looked a bit at the code and did some testing - the difference between
> the original mail (which is stored and displayed correctly in the
> archives database) and the two replys that have it corrupted is how the
> line wrapping for the Subject is done(basically linebreak + space in the
> first version and linebreak+tab in the broken one).
>
> We use decode_header() from the python email package to parse headers
> and it is actually capable of correctly decoding both variants.
> However there is a special hack in our importer code citing
> http://bugs.python.org/issue504152 that removes \n\t unconditionally
> from the raw string.
> I dont know the details of why that was put in originally but that
> surely must be wrong in general because it removes the required
> seperation between different header words through a linear whitespace
> per RFC2047(because in this case it leaves no seperation at all causing
> header_decode() to go haywire).
> I think it was magnus who put that special case in so maybe he can shed
> some light on the issue this change was targeted at?
>
>

He can not, unfortunately. That was years ago and I don't have a testcase
around for it.

As discussed with Stefan, we need to set up a proper testbench to make sure
we don't break something else when/if we remove this change. It's on my
TODO list, and I just wanted to ack that in this thread. This is clearly a
bug in the archives code that needs to get fixed, it'll just take a bit
longer as we don't currently have a way to test across the 1M+ messages
that are in the archives today yet.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

In response to

Browse pgsql-www by date

  From Date Subject
Next Message Magnus Hagander 2015-09-29 10:46:56 Re: Deselecting "Receive Mail" on the website ml subscribe form is broken
Previous Message Alvaro Herrera 2015-09-29 04:51:14 Re: No easy way to join discussion in existing thread when not subscribed