Quick Links

Re: Idea for improving buildfarm robustness

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: Idea for improving buildfarm robustness
Date:	2015-09-29 18:57:49
Message-ID:	20150929185749.GG3685@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> But today I thought of another way: suppose that we teach the postmaster
> to commit hara-kiri if the $PGDATA directory goes away. Since the
> buildfarm script definitely does remove all the temporary data directories
> it creates, this ought to get the job done.

Yes, please.

> An easy way to do that would be to have it check every so often if
> pg_control can still be read. We should not have it fail on ENFILE or
> EMFILE, since that would create a new failure hazard under heavy load,
> but ENOENT or similar would be reasonable grounds for deciding that
> something is horribly broken. (At least on Windows, failing on EPERM
> doesn't seem wise either, since we've seen antivirus products randomly
> causing such errors.)

Sounds pretty reasonable to me.

> I wouldn't want to do this every time through the postmaster's main loop,
> but we could do this once an hour for no added cost by adding the check
> where it does TouchSocketLockFiles; or once every few minutes if we
> carried a separate variable like last_touch_time. Once an hour would be
> plenty to fix the buildfarm's problem, I should think.

I have a bad (?) habit of doing exactly this during development and
would really like it to be a bit more often than once/hour, unless
there's a particular problem with that.

> Another question is what exactly "commit hara-kiri" should consist of.
> We could just abort() or _exit(1) and leave it to child processes to
> notice that the postmaster is gone, or we could make an effort to clean
> up. I'd be a bit inclined to treat it like a SIGQUIT situation, ie
> kill all the children and exit. The children are probably having
> problems of their own if the data directory's gone, so forcing
> termination might be best to keep them from getting stuck.

I like the idea of killing all the children and then exiting.

> Also, perhaps we'd only enable this behavior in --enable-cassert builds,
> to avoid any risk of a postmaster incorrectly choosing to suicide in a
> production scenario. Or maybe that's overly conservative.

That would work for my use-case. Perhaps only on --enable-cassert
builds for back-branches but enable it in master and see how things go
for 9.6? I agree that it feels overly conservative, but given our
recent history, we should be overly cautious with the back branches.

> Thoughts?

Thanks!

Stephen

In response to

Idea for improving buildfarm robustness at 2015-09-29 18:48:58 from Tom Lane

Responses

Re: Idea for improving buildfarm robustness at 2015-09-29 19:07:04 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2015-09-29 19:07:04	Re: Idea for improving buildfarm robustness
Previous Message	Josh Berkus	2015-09-29 18:56:46	Re: Idea for improving buildfarm robustness