From: | Stephen Frost <sfrost(at)snowman(dot)net> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-hackers(at)postgreSQL(dot)org |
Subject: | Re: Idea for improving buildfarm robustness |
Date: | 2015-09-29 18:57:49 |
Message-ID: | 20150929185749.GG3685@tamriel.snowman.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> But today I thought of another way: suppose that we teach the postmaster
> to commit hara-kiri if the $PGDATA directory goes away. Since the
> buildfarm script definitely does remove all the temporary data directories
> it creates, this ought to get the job done.
Yes, please.
> An easy way to do that would be to have it check every so often if
> pg_control can still be read. We should not have it fail on ENFILE or
> EMFILE, since that would create a new failure hazard under heavy load,
> but ENOENT or similar would be reasonable grounds for deciding that
> something is horribly broken. (At least on Windows, failing on EPERM
> doesn't seem wise either, since we've seen antivirus products randomly
> causing such errors.)
Sounds pretty reasonable to me.
> I wouldn't want to do this every time through the postmaster's main loop,
> but we could do this once an hour for no added cost by adding the check
> where it does TouchSocketLockFiles; or once every few minutes if we
> carried a separate variable like last_touch_time. Once an hour would be
> plenty to fix the buildfarm's problem, I should think.
I have a bad (?) habit of doing exactly this during development and
would really like it to be a bit more often than once/hour, unless
there's a particular problem with that.
> Another question is what exactly "commit hara-kiri" should consist of.
> We could just abort() or _exit(1) and leave it to child processes to
> notice that the postmaster is gone, or we could make an effort to clean
> up. I'd be a bit inclined to treat it like a SIGQUIT situation, ie
> kill all the children and exit. The children are probably having
> problems of their own if the data directory's gone, so forcing
> termination might be best to keep them from getting stuck.
I like the idea of killing all the children and then exiting.
> Also, perhaps we'd only enable this behavior in --enable-cassert builds,
> to avoid any risk of a postmaster incorrectly choosing to suicide in a
> production scenario. Or maybe that's overly conservative.
That would work for my use-case. Perhaps only on --enable-cassert
builds for back-branches but enable it in master and see how things go
for 9.6? I agree that it feels overly conservative, but given our
recent history, we should be overly cautious with the back branches.
> Thoughts?
Thanks!
Stephen
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2015-09-29 19:07:04 | Re: Idea for improving buildfarm robustness |
Previous Message | Josh Berkus | 2015-09-29 18:56:46 | Re: Idea for improving buildfarm robustness |