Re: PGLister fails to de-dup messages addressed twice to same list

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-www(at)lists(dot)postgresql(dot)org
Subject: Re: PGLister fails to de-dup messages addressed twice to same list
Date: 2017-11-22 20:33:18
Message-ID: CABUevEy5BXAA6_SUxzNcnLv+zgvbP8hMv8C5MTL+ugs5yBFRGA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-www

On Tue, Nov 21, 2017 at 4:43 PM, Stephen Frost <sfrost(at)snowman(dot)net> wrote:

>
> * Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> > Stephen Frost <sfrost(at)snowman(dot)net> writes:
> > > * Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> > >> ... I have no doubt at all that that's
> > >> going to happen a *lot* during the list domain changeover, so I'd
> > >> strongly recommend putting something in place to de-dup.
> >
> > > Yeah, I'm already chatting w/ Magnus about this.
> >
> > Curiously, my replies to the same message seem to have been delivered
> > only once, and that's not because I was awake enough to notice and
> > remove the extra cc ;-). So my guess at this point is that you do
> > have some de-dup in there, but it ain't working for gmail-originated
> > messages.
>
> As near as I can tell, GMail delivered the message to us in two
> independent runs with two connections to our mail server, while your
> server only delivered one message in one run to our server.
>

Yup, that's indeed what happened.

> I'm guessing that your server realized it was the same MX for both
> postgresql.org and lists.postgresql.org and expected our server to
> handle delivering to the multiple addresses, but PGLister, for a given
> email that comes in, is only going to deliver once to each of the lists
> that are listed in the inbound email. On the other hand, GMail seems to
> split the email on the source side for each domain/subdomain and
> delivers them independently.
>
> Unfortunately, we aren't going to be able to depend on the sender's MTA
> to always put the message into one email to us, as made clear by GMail
> but also because it's not really "correct." We need to have a
> message-id cache in the PG database that will throw away dups when they
> come in on a per-list basis. I don't anticipate it being too difficult
> to implement, really, but I think we'll need it to last at least a
> couple of days which implies having a cleanup job for it, et al.
>

I have deployed what I think is the correct way to deal with this
deduplication. Basically it tracks if an existing combination of (msgid,
list) has been seen before, and if it has the new copy is dropped on the
floor (with a log of course). We were already keeping track of that
information (though in two different tables), so the extra check was easy
and will be cheap.

A db check shows we have 33 emails so far delivered duplicated across
lists. Mostly to general (22 of those mails), but a few to other lists too.

So far no attempt has been made since I deployed the check, but they only
show up once every few hours so we'll wait a while to see if it works.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

In response to

Browse pgsql-www by date

  From Date Subject
Next Message Ivan E. Panchenko 2017-11-23 12:03:28 Re: Postgres Pro build for windows
Previous Message Magnus Hagander 2017-11-22 19:09:12 Re: [pgcommitfest2] update README