From: | Merlin Moncure <mmoncure(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Oskari Saarenmaa <os(at)ohmu(dot)fi>, Ants Aasma <ants(dot)aasma(at)eesti(dot)ee>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: emergency outage requiring database restart |
Date: | 2016-11-01 15:05:11 |
Message-ID: | CAHyXU0yjuFNeHv2NDhxTQmrpabNT4HG54mTZ=E9UkWbbumFzWQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Nov 1, 2016 at 8:56 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Merlin Moncure <mmoncure(at)gmail(dot)com> writes:
>> On Mon, Oct 31, 2016 at 10:32 AM, Oskari Saarenmaa <os(at)ohmu(dot)fi> wrote:
>>> Your production system's postgres backends probably have a lot more open
>>> files associated with them than the simple test case does. Since Postgres
>>> likes to keep files open as long as possible and only closes them when you
>>> need to free up fds to open new files, it's possible that your production
>>> backends have almost all allowed fds used when you execute your pl/sh
>>> function.
>>>
>>> If that's the case, the sqsh process that's executed may not have enough fds
>>> to do what it wanted to do and because of busted error handling could end up
>>> writing to fds that were opened by Postgres and point to $PGDATA files.
>
>> Does that apply? the mechanics are a sqsh function that basically does:
>> cat foo.sql | sqsh <args>
>> pipe redirection opens a new process, right?
>
> Yeah, but I doubt that either level of the shell would attempt to close
> inherited file handles.
>
> The real problem with Oskari's theory is that it requires not merely
> busted, but positively brain-dead error handling in the shell and/or
> sqsh, ie ignoring open() failures altogether. That seems kind of
> unlikely. Still, I suspect he might be onto something --- there must
> be some reason you can reproduce the issue in production and not in
> your test bed, and number-of-open-files is as good a theory as I've
> heard.
>
> Maybe the issue is not with open() failures, but with the resulting
> FD numbers being much larger than sqsh is expecting. It would be
> weird to try to store an FD in something narrower than int, but
> I could see a use of select() being unprepared for large FDs.
> Still, it's hard to translate that idea into scribbling on the
> wrong file...
Looking at the sqsh code, nothing really stands out. It's highly
developed and all obvious errors are checked. There certainly could
be a freak bug in there (or in libfreetds which sqsh links to) doing
the damage though. In the meantime I'll continue to try and work a
reliable reproduction. This particular routine only gets called in
batches on a quarterly basis so things have settled down.
Just a thought; could COPY be tricked into writing into the wrong file
descriptor? For example, if a file was killed with a rm -rf and the
fd pressured backend reopened the fd immediately?
merlin
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2016-11-01 15:15:03 | Re: DROP FUNCTION of multiple functions |
Previous Message | Tomas Vondra | 2016-11-01 14:58:03 | Re: auto_explain vs. parallel query |