From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Jerry Sievers <gsievers19(at)comcast(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: SegFault on 9.6.14 |
Date: | 2019-07-30 18:35:05 |
Message-ID: | CA+Tgmoay7tcgW=ZVX3N7nM8oqs3Hqx0eV8k9cCSriJBAtat8SA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Jul 18, 2019 at 9:45 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I think this is going in the wrong direction. Nodes should *always*
> assume that a rescan is possible until ExecEndNode is called.
> If you want to do otherwise, you are going to be inventing a whole
> bunch of complicated and doubtless-initially-buggy control logic
> to pass down information about whether a rescan might be possible.
> That doesn't sound like a recipe for a back-patchable fix. Perhaps
> we could consider redesigning the rules around REWIND in a future
> version, but that's not where to focus the bug fix effort.
So, if I can summarize how we got here, as best I understand it:
0. The historic behavior of the executor is to assume it's OK to leak
resources for the lifetime of the query. Nodes that are executed to
completion generally do some cleanup, but we feel free (as under
Limit) to just stop executing a node without giving it any hint that
it should release resources. So a Sort may hold onto a terabyte of
memory and an index scan may keep holding a pin even after there's no
theoretical way of ever needing those resources again, and we just
don't care.
1. Parallel query made that perhaps-already-shaky assumption a lot
more problematic. Partly that's because workers are a a more scarce
and considerably heavier resource than anything else, and moreover act
as a container for anything else, so whatever you were leaking before,
you can now leak N times more of it, plus N processes, until the end
of the query. However, there's a correctness reason too, which is that
when a node has a copy in the leader and a copy in every worker, each
copy has its own instrumentation data (startup time, run time, nloops,
etc) and we can only fold all that together once the node is done
executing, because it's really hard to add up a bunch of numbers
before the numbers are done changing. We could've made the
instrumentation shared throughout, but if we had, we could have
contention for updating the instrumentation data, which seems like
it'd be bad.
2. To fix that correctness problem, we decided to try to shut down the
node under a limit node when we're done with it (commit
85c9d3475e4f680dbca7c04fe096af018f3b8760). At a certain level, this
looks fundamentally necessary to me. If you're going to have N
separate copies of the instrumentation, and you want to add them up
when you're done, then you have to decide to be done at some point;
otherwise you don't know when to add them up, and maybe won't add them
up at all, and then you'll be sad. This does not mean that the exact
timing couldn't be changed somehow, but if you want a correct
implementation, you have to shut down Limit's sub-node after you're
done executing it (so that you can get the instrumentation data from
the workers after it's final) and before you start destroying DSM
segments and stuff (so that you get the instrumentation data from the
workers before it vanishes).
3. The aforementioned commit turned out to be buggy in at least to two
ways, precisely because it didn't do a good enough job predicting when
the Limit needed to be shut down. First, there was commit
2cd0acfdade82f3cab362fd9129d453f81cc2745, where we missed the fact
that you could hit the Limit and then back up. Second, there's the
present issue, where the Limit gets rescanned.
So, given all that, if we want to adopt Tom's position that we should
always cater to a possible rescan, then we're going to have to rethink
the way that instrumentation data gets consolidated from workers into
the leader in such a way that we can consolidate multiple times
without ending up with the wrong answer. The other option is to do
what I understand Amit and Thomas to be proposing, which is to do a
better job identifying the case where we're "done for good" and can
trigger the shutdown fearlessly.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2019-07-30 18:39:30 | Re: tap tests driving the database via psql |
Previous Message | Karl O. Pinc | 2019-07-30 18:27:08 | Re: Patch to document base64 encoding |