Re: Post-2018 messages in archives

From: Noah Misch <noah(at)leadboat(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: PostgreSQL WWW <pgsql-www(at)postgresql(dot)org>
Subject: Re: Post-2018 messages in archives
Date: 2018-12-06 04:27:16
Message-ID: 20181206042716.GB2945370@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-www

On Wed, Dec 05, 2018 at 09:39:18AM +0100, Magnus Hagander wrote:
> On Wed, Dec 5, 2018 at 2:53 AM Noah Misch <noah(at)leadboat(dot)com> wrote:
> > On Mon, Dec 03, 2018 at 10:08:20AM +0100, Magnus Hagander wrote:
> > > On Mon, Dec 3, 2018 at 2:40 AM Noah Misch <noah(at)leadboat(dot)com> wrote:
> > > > At some point in the last few months, the archives of many mailing
> > lists
> > > > added
> > > > messages dated far in the future. For example, pgsql-hackers archives
> > > > gained
> > > > four messages from years 2030, 2032 and 2036:
> > > >
> > > > https://www.postgresql.org/list/pgsql-hackers/since/203011010000/
> >
> > > > Perhaps the fix is to set the archive date to the archives ingest time
> > when
> > > > the message asserts a date substantially (15min?) earlier or later.
> > Would
> > > > that be an improvement?
> >
> > > Unfortunately we don't keep the ingest time separately. But for the
> > future,
> > > doing so would probably be a good idea, for other reasons as well. I
> > think
> > > 15 minutes might be pushing it a bit given the kind of times we see
> > around,
> > > in particular with incorrectly configured timezones. But something like
> > 24h
> > > would probably work.
> > >
> > > Luckily, it's not too terribly bad:
> > >
> > > archives=# select count(*) from messages where date > now();
> > > count
> > > -------
> > > 10
> > > (1 row)
> > >
> > > (out of about 1.3M messages).
> > >
> > > So short-term I will go process those messages manually.
> >
> > Data looks clean now. Thanks. If the problem remains as rare as it has
> > been,
> > the automated fix I was contemplating is premature.
> >
>
> Thanks for confirming.
>
> I think it's still needed, in case either (1) it happens again, or (2) we
> reparse the archives fully again which will reset it all. It's not too
> urgent at this point though, but I've left it on my TODO list to make sure
> we have a safeguard in there.

Works for me. Pondering it more, the timestamp that matters most for archive
purposes is the timestamp at which list subscribers started to receive their
copies of the message. Based on that, I'm thinking we should ignore the Date
header and always use the timestamp from a particular "Received ... by
HOSTNAME.postgresql.org" header. Before settling on that, I'd want to check
how many messages change timestamp by more than ~100s, and I'd want to spot
check a few messages to see whether the change looks like an improvement.

In response to

Responses

Browse pgsql-www by date

  From Date Subject
Next Message Tom Lane 2018-12-06 04:31:39 Re: Post-2018 messages in archives
Previous Message Magnus Hagander 2018-12-05 15:44:49 Re: Dropping training events