Re: pgsql-bugs mailing list dump?

From: Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: PostgreSQL WWW <pgsql-www(at)postgresql(dot)org>
Subject: Re: pgsql-bugs mailing list dump?
Date: 2020-12-23 22:54:00
Message-ID: 20201223235400.226428d8@firost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-www

On Tue, 22 Dec 2020 11:11:10 +0100
Magnus Hagander <magnus(at)hagander(dot)net> wrote:

> On Wed, Dec 16, 2020 at 3:53 PM Jehan-Guillaume de Rorthais
> <jgdr(at)dalibo(dot)com> wrote:
> >
> > Hello Magnus,
> >
> > On Wed, 16 Dec 2020 15:02:03 +0100
> > Magnus Hagander <magnus(at)hagander(dot)net> wrote:
> >
> > > On Wed, Dec 16, 2020 at 2:57 PM Jehan-Guillaume de Rorthais
> > > <jgdr(at)dalibo(dot)com> wrote:
> > [...]
> > > > However, maybe some admins would agree to provide some pgsql dump or
> > > > access to some json API if relevant? We would save some time and
> > > > CPU :)
> > >
> > > There are mbox files available for download from the list archives --
> > > would that work for you? It can be done on a per-thread basis as well,
> > > i guess, but that's not something we have now (that is, we don't have
> > > a unique listing of threads).
> >
> > The srht import API process one JSON documents per thread. That's why we
> > try to gather one mbox per thread.
>
> There must be something I'm missing here, because that sounds.. Insane?
>
> Basically they take a raw mbox and wrap it in json? Just to make it
> less efficient?
>
> And they specifically need the "outside" to have done the one thing
> that's actually hard, namely threading?
>
> What are they actually trying to accomplish here?

This would be perfectly insane and crazy :) Such a story would be a dead end
right from the start.

No, the sr.ht import script is accepting a pure json doc *only*. They do not
require you to wrap mbox in json. The whole thread must be in json following
their **import/export** format.

When downloading mbox from postgresql.org, we have to write the wheel to
transform mbox to json.

Note that in production, the bug tracker relies on a mailing list managed by
sr.ht. Each mails is parsed and stored in pgsql.

> > > But if you're building your own threading on it, then the monthly mbox
> > > files at https://www.postgresql.org/list/pgsql-bugs/ should be enough?
> >
> > Yes, we already got them to start pocking around. We have a small
> > python script processing them but mbox format and/or python lib and/or email
> > format are a bit loose and we currently have 3k orphans emails out of 13697
> > threads.
>
> Oh, there is a lot of weirdness in the email archives, particularly in
> history (it's gotten a bit better, but we still see really weird mime
> combinations fairly often). And there have been many crappy
> implementations of mbox over the years as well, which has led to a lot
> of problems of imports :/

Indeed. But anyway, my colleague's script is already able to sort out most of
the troubles. Good enough for now.

> So the root question there is, why are we exactring more structured
> data into a format that we know is worse?

The root question was me asking if a database dump or access to some json API
would be somehow possible. I should have quickly explained this was to extract
data as json from there.

My bad, really. I hope the whole picture is clearer now.

> > BTW, we found some orphans emails in pgarchiver UI as well that might be
> > fixed if you are interested. The in-reply-to field is malformed but a
> > message-id is still available there, eg:
> > https://postgr.es/m/4454.935677480%40sss.pgh.pa.us.
>
> I'm not sure we want to go down the route of manually editing
> messages. It would work for a message like this from 1999 because
> that's before DKIM which would prevent us from doing it at all. But
> either way the archives should represent what things actually looked
> like as much as possible. And from an archives perspective that it not
> an orphaned thread, that is a single message sent on it's own thread
> (and we have plenty of those in general).

Sure.

> > Without any better solution, maybe our current method is "good enough" for a
> > simple PoC. We could tighten/rewrite this part of the procedure in a second
> > round if it worth it.
>
> Probably.
>
> But if you are somehow crawling the per-thread mbox urls please make
> sure you rate limit yourself severely. They're really not meant to be
> API endpoints...

As far as I know, we now have enough data to move ahead. We should not crawl
again soon. We will do some rate limit if needed in the futur, but I hope we
will not have to deal with mbox anymore.

Thanks!

In response to

Browse pgsql-www by date

  From Date Subject
Next Message Tom Lane 2020-12-28 15:08:45 Busted links in commit emails
Previous Message Magnus Hagander 2020-12-22 10:11:10 Re: pgsql-bugs mailing list dump?