Re: Idea for improving buildfarm robustness

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Idea for improving buildfarm robustness
Date: 2015-09-29 18:57:49
Message-ID: 20150929185749.GG3685@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> But today I thought of another way: suppose that we teach the postmaster
> to commit hara-kiri if the $PGDATA directory goes away. Since the
> buildfarm script definitely does remove all the temporary data directories
> it creates, this ought to get the job done.

Yes, please.

> An easy way to do that would be to have it check every so often if
> pg_control can still be read. We should not have it fail on ENFILE or
> EMFILE, since that would create a new failure hazard under heavy load,
> but ENOENT or similar would be reasonable grounds for deciding that
> something is horribly broken. (At least on Windows, failing on EPERM
> doesn't seem wise either, since we've seen antivirus products randomly
> causing such errors.)

Sounds pretty reasonable to me.

> I wouldn't want to do this every time through the postmaster's main loop,
> but we could do this once an hour for no added cost by adding the check
> where it does TouchSocketLockFiles; or once every few minutes if we
> carried a separate variable like last_touch_time. Once an hour would be
> plenty to fix the buildfarm's problem, I should think.

I have a bad (?) habit of doing exactly this during development and
would really like it to be a bit more often than once/hour, unless
there's a particular problem with that.

> Another question is what exactly "commit hara-kiri" should consist of.
> We could just abort() or _exit(1) and leave it to child processes to
> notice that the postmaster is gone, or we could make an effort to clean
> up. I'd be a bit inclined to treat it like a SIGQUIT situation, ie
> kill all the children and exit. The children are probably having
> problems of their own if the data directory's gone, so forcing
> termination might be best to keep them from getting stuck.

I like the idea of killing all the children and then exiting.

> Also, perhaps we'd only enable this behavior in --enable-cassert builds,
> to avoid any risk of a postmaster incorrectly choosing to suicide in a
> production scenario. Or maybe that's overly conservative.

That would work for my use-case. Perhaps only on --enable-cassert
builds for back-branches but enable it in master and see how things go
for 9.6? I agree that it feels overly conservative, but given our
recent history, we should be overly cautious with the back branches.

> Thoughts?

Thanks!

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2015-09-29 19:07:04 Re: Idea for improving buildfarm robustness
Previous Message Josh Berkus 2015-09-29 18:56:46 Re: Idea for improving buildfarm robustness