Quick Links

Re: Parallel tuplesort (for parallel B-Tree index creation)

From:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject:	Re: Parallel tuplesort (for parallel B-Tree index creation)
Date:	2017-02-08 07:40:41
Message-ID:	CAEepm=1xmY=OLfXJPr6FyS7GKb4t4eQt_nbMDg=WLgizqBy4Sw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Feb 7, 2017 at 5:43 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> However, there are some specific implementation issues with this that
> I didn't quite anticipate. I would like to get feedback on these
> issues now, from both Thomas and Robert. The issues relate to how much
> the patch can or should "buy into resource management". You might
> guess that this new resource management code is something that should
> live in fd.c, alongside the guts of temp file resource management,
> within the function FileClose(). That way, it would be called by every
> possible path that might delete a temp file, including
> ResourceOwnerReleaseInternal(). That's not what I've done, though.
> Instead, refcount management is limited to a few higher level routines
> in buffile.c. Initially, resource management in FileClose() is made to
> assume that it must delete the file. Then, if and when directed to by
> BufFileClose()/refcount, a backend may determine that it is not its
> job to do the deletion -- it will not be the one that must "turn out
> the lights", and so indicates to FileClose() that it should not delete
> the file after all (it should just release vFDs, close(), and so on).
> Otherwise, when refcount reaches zero, temp files are deleted by
> FileClose() in more or less the conventional manner.
>
> The fact that there could, in general, be any error that causes us to
> attempt a double-deletion (deletion of a temp file from more than one
> backend) for a time is less of a problem than you might think. This is
> because there is a risk of this only for as long as two backends hold
> open the file at the same time. In the case of parallel CREATE INDEX,
> this is now the shortest possible period of time, since workers close
> their files using BufFileClose() immediately after the leader wakes
> them up from a quiescent state. And, if that were to actually happen,
> say due to some random OOM error during that small window, the
> consequence is no worse than an annoying log message: "could not
> unlink file..." (this would come from the second backend that
> attempted an unlink()). You would not see this when a worker raised an
> error due to a duplicate violation, or any other routine problem, so
> it should really be almost impossible.
>
> That having been said, this probably *is* a problematic restriction in
> cases where a temp file's ownership is not immediately handed over
> without concurrent sharing. What happens to be a small window for the
> parallel CREATE INDEX patch probably wouldn't be a small window for
> parallel hash join. :-(
>
> It's not hard to see why I would like to do things this way. Just look
> at ResourceOwnerReleaseInternal(). Any release of a file happens
> during RESOURCE_RELEASE_AFTER_LOCKS, whereas the release of dynamic
> shared memory segments happens earlier, during
> RESOURCE_RELEASE_BEFORE_LOCKS. ISTM that the only sensible way to
> implement a refcount is using dynamic shared memory, and that seems
> hard. There are additional reasons why I suggest we go this way, such
> as the fact that all the relevant state belongs to BufFile, which is
> implemented a layer above all of the guts of resource management of
> temp files within fd.c. I'd have to replicate almost all state in fd.c
> to make it all work, which seems like a big modularity violation.
>
> Does anyone have any suggestions on how to tackle this?

Hmm. One approach might be like this:

1. There is a shared refcount which is incremented when you open a
shared file and decremented if you optionally explicitly 'release' it.
(Not when you close it, because we can't allow code that may be run
during RESOURCE_RELEASE_AFTER_LOCKS to try to access the DSM segment
after it has been unmapped; more generally, creating destruction order
dependencies between different kinds of resource-manager-cleaned-up
objects seems like a bad idea. Of course the close code still looks
after closing the vfds in the local backend.)

2. If you want to hand the file over to some other process and exit,
you probably want to avoid race conditions or extra IPC burden. To
achieve that you could 'pin' the file, so that it survives even while
not open in any backend.

3. If the recount reaches zero when you 'release' and the file isn't
'pinned', then you must delete the underlying files.

4. When the DSM segment is detached, we spin through all associated
shared files that we're still 'attached' to (ie opened but didn't
release) and decrement the refcount. If any shared file's refcount
reaches zero its files should be deleted, even if was 'pinned'.

In other words, the associated DSM segment's lifetime is the maximum
lifetime of shared files, but it can be shorter if you 'release' in
all backends and don't 'pin'. It's up to client code can come up with
some scheme to make that work, if it doesn't take the easy route of
pinning until DSM segment destruction.

I think in your case you'd simply pin all the BufFiles allowing
workers to exit when they're done; the leader would wait for all
workers to indicate they'd finished, and then open the files. The
files would be deleted eventually when the last process detaches from
the DSM segment (very likely the leader).

In my case I'd pin all shared BufFiles and then release them when I'd
finished reading them back in and didn't need them anymore, and unpin
them in the first participant to discover that the end had been
reached (it would be a programming error to pin twice or unpin twice,
like similarly named operations for DSM segments and DSA areas).
That'd preserve the existing Hash Join behaviour of deleting batch
files as soon as possible, but also guarantee cleanup in any error
case.

There is something a bit unpleasant about teaching other subsystems
about the existence of DSM segments just to be able to use DSM
lifetime as a cleanup scope. I do think dsm_on_detach is a pretty
good place to do cleanup of resources in parallel computing cases like
ours, but I wonder if we could introduce a more generic destructor
callback interface which DSM segments could provide.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Re: Parallel tuplesort (for parallel B-Tree index creation) at 2017-02-07 04:43:35 from Peter Geoghegan

Responses

Re: Parallel tuplesort (for parallel B-Tree index creation) at 2017-02-08 10:36:49 from Thomas Munro

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Pantelis Theodosiou	2017-02-08 09:13:49	Re: Idea on how to simplify comparing two sets
Previous Message	Pavel Stehule	2017-02-08 07:33:30	Re: possibility to specify template database for pg_regress