From: | Magnus Hagander <magnus(at)hagander(dot)net> |
---|---|
To: | Matteo Beccati <php(at)beccati(dot)com> |
Cc: | Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Dave Page <dpage(at)pgadmin(dot)org>, Abhijit Menon-Sen <ams(at)toroid(dot)org>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Tim Bunce <Tim(dot)Bunce(at)pobox(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: mailing list archiver chewing patches |
Date: | 2010-01-12 20:04:27 |
Message-ID: | 9837222c1001121204m52adbf21k6260609f1c8768b@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers pgsql-www |
On Tue, Jan 12, 2010 at 20:56, Matteo Beccati <php(at)beccati(dot)com> wrote:
> Il 12/01/2010 10:30, Magnus Hagander ha scritto:
>>
>> The problem is usually with strange looking emails with 15 different
>> MIME types. If we can figure out the proper way to render that, the
>> rest really is just a SMOP.
>
> Yeah, I was expecting some, but all the message I've looked at seemed to be
> working ok.
Have you been looking at old or new messages? Try grabbing a couple of
MBOX files off archives.postgresql.org from several years back, you're
more likely to find weird MUAs then I think.
>> (BTW, for something to actually be used In Production (TM), we want
>> something that uses one of our existing frameworks. So don't go
>> overboard in code-wise implementations on something else - proof of
>> concept on something else is always ok, of course)
>
> OK, that's something I didn't know, even though I expected some kind of
> limitations. Could you please elaborate a bit more (i.e. where to find
> info)?
Well, the framework we're moving towards is built on top of django, so
that would be a good first start.
There is also whever the commitfest thing is built on, but I'm told
that's basically no framework.
> Having played with it, here's my feedback about AOX:
>
> pros:
> - seemed to be working reliably;
> - does most of the dirty job of parsing emails, splitting parts, etc
> - highly normalized schema
> - thread support (partial?)
A killer will be if that thread support is enough. If we have to build
that completely ourselves, it'll take a lot more work.
> cons:
> - directly publishing the live email feed might not be desirable
Why not?
> - queries might end up being a bit complicate for simple tasks
As long as we don't have to hit them too often, which is solve:able
with caching. And we do have a pretty good RDBMS to run the queries on
:)
>> I don't think you can trust the NNTP gateway now or in the past,
>> messages are sometimes lost there. The mbox files are as complete as
>> anything we'll ever get.
>
> Importing the whole pgsql-www archive with a perl script that bounces
> messages via SMTP took about 30m. Maybe there's even a way to skip SMTP, I
> haven't looked into it that much.
Um, yes. There is an MBOX import tool.
>>>> - We need to generate thread indexes
>>>
>>> We have CTEs :)
>>
>> Right. We still need the threading information, so we have something
>> to use our CTEs on :-)
>>
>> But I assume that AOX already does this?
>
> there are thread related tables and they seem to get filled when a SORT IMAP
> command is issued, however I haven't found a way to get the hierarchy out of
> them.
>
> What that means is that we'd need some kind of post processing to populate a
> thread hierarchy.
>
> If there isn't a fully usable thread hierarchy I was more thinking to ltree,
> mainly because I've successfully used it in past and I haven't had enough
> time yet to look at CTEs. But if performance is comparable I don't see a
> reason why we shouldn't use them.
I'd favor CTEs if they are fast enough. Great flexibility.
>>>> - We need to re-generate the original URLs for backwards compatibility
>>>
>>> I guess the message-id one ain't the tricky one... and it should be
>>> possible to fill a relation table like
>>> monharc_compat(message_id, list, year, month, message_number);
>>
>> Yeah. It's not so hard, you can just screen-scrape the current
>> archives the same way the search server does.
>
> Definitely an easy enough task.
>
> With all that said, I can't promise anything as it all depends on how much
> spare time I have, but I can proceed with the evaluation if you think it's
> useful. I have a feeling that AOX is not truly the right tool for the job,
> but we might be able to customise it to suit our needs. Are there any other
> requirements that weren't specified?
Well, I think we want to avoid customizing it. Using a custom
frontend, sure. But we don't want to end up customizing the
parser/backend. That's the road to unmaintainability.
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
From | Date | Subject | |
---|---|---|---|
Next Message | Marko Kreen | 2010-01-12 20:06:10 | Re: Streaming replication status |
Previous Message | Matteo Beccati | 2010-01-12 19:58:09 | Re: mailing list archiver chewing patches |
From | Date | Subject | |
---|---|---|---|
Next Message | Aidan Van Dyk | 2010-01-12 20:16:47 | Re: mailing list archiver chewing patches |
Previous Message | Matteo Beccati | 2010-01-12 19:58:09 | Re: mailing list archiver chewing patches |