Quick Links

Re: WIP: [[Parallel] Shared] Hash

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc:	Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: WIP: [[Parallel] Shared] Hash
Date:	2017-03-22 23:55:28
Message-ID:	CAH2-WzkJC4sXAUJJsbXQtSZ7jqMNsnaouknzC-yZjQokEAaFzw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Mar 22, 2017 at 3:17 AM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>> If I follow the new code correctly, then it doesn't matter that you've
>> unlink()'d to take care of the more obvious resource management chore.
>> You can still have a reference leak like this, if I'm not mistaken,
>> because you still have backend local state (local VfdCache) that is
>> left totally decoupled with the new "shadow resource manager" for
>> shared BufFiles.
>
> You're right. The attached version fixes these problems. The
> BufFiles created or opened in this new way now participate in both of
> our leak-detection and clean-up schemes: the one in resowner.c
> (because I'm now explicitly registering with it as I had failed to do
> before) and the one in CleanupTempFiles (because FD_CLOSE_AT_EOXACT is
> set, which I already had in the previous version for the creator, but
> not the opener of such a file). I tested by commenting out my
> explicit BufFileClose calls to check that resowner.c starts
> complaining, and then by commenting out the resowner registration too
> to check that CleanupTempFiles starts complaining.

I took a quick look at your V9 today. This is by no means a
comprehensive review of 0008-hj-shared-buf-file-v9.patch, but it's
what I can manage right now.

The main change you made is well represented by the following part of
the patch, where you decouple close at eoXact with delete at eoXact,
with the intention of doing one but not the other for BufFiles that
are shared:

> /* these are the assigned bits in fdstate below: */
> -#define FD_TEMPORARY (1 << 0) /* T = delete when closed */
> -#define FD_XACT_TEMPORARY (1 << 1) /* T = delete at eoXact */
> +#define FD_DELETE_AT_CLOSE (1 << 0) /* T = delete when closed */
> +#define FD_CLOSE_AT_EOXACT (1 << 1) /* T = close at eoXact */
> +#define FD_TEMP_FILE_LIMIT (1 << 2) /* T = respect temp_file_limit */

So, shared BufFile fd.c segments within backend local resource manager
do not have FD_DELETE_AT_CLOSE set, because you mean to do that part
yourself by means of generic shared cleanup through dynamic shared
memory segment callback. So far so good.

However, I notice that the place that this happens instead,
PathNameDelete(), does not repeat the fd.c step of doing a final
stat(), and using the stats for a pgstat_report_tempfile(). So, a new
pgstat_report_tempfile() call is simply missing. However, the more
fundamental issue in my mind is: How can you fix that? Where would it
go if you had it?

If you do the obvious thing of just placing that before the new
unlink() within PathNameDelete(), on the theory that that needs parity
with the fd.c stuff, that has non-obvious implications. Does the
pgstat_report_tempfile() call need to happen when going through this
path, for example?:

> +/*
> + * Destroy a shared BufFile early. Files are normally cleaned up
> + * automatically when all participants detach, but it might be useful to
> + * reclaim disk space sooner than that. The caller asserts that no backends
> + * will attempt to read from this file again and that only one backend will
> + * destroy it.
> + */
> +void
> +SharedBufFileDestroy(SharedBufFileSet *set, int partition, int participant)
> +{

The theory with the unlink()'ing() function PathNameDelete(), I
gather, is that it doesn't matter if it happens to be called more than
once, say from a worker and then in an error handling path in the
leader or whatever. Do I have that right?

Obviously the concern I have about that is that any stat() call you
might add for the benefit of a new pgstat_report_tempfile() call,
needed to keep parity with fd.c, now has a risk of double counting in
error paths, if I'm not mistaken. We do need to do that accounting in
the event of error, just as we do when there is no error, at least if
current stats collector behavior is to be preserved. How can you
determine which duplicate call here is the duplicate? In other words,
how can you figure out which one is not supposed to
pgstat_report_tempfile()? If the size of temp files in each worker is
unknowable to the implementation in error paths, does it not follow
that it's unknowable to the user that queries pg_stat_database?

Now, I don't imagine that this should stump you. Maybe I'm wrong about
that possibility (that you cannot have exactly once
unlink()/stat()/whatever), or maybe I'm right and you can fix it while
preserving existing behavior, for example by relying on unlink()
reliably failing when called a second time, no matter how tight any
race was. What exact semantics does unlink() have with concurrency, as
far as the link itself goes?

If I'm not wrong about the general possibility, then maybe the
existing behavior doesn't need to be preserved in error paths, which
are after all exceptional -- it's not as if the statistics collector
is currently highly reliable. It's not obvious that you are
deliberately accepting of any of these risks or costs, though, which I
think needs to be clearer, at a minimum. What trade-off are you making
here?

Unfortunately, that's about the only useful piece of feedback that I
can think of right now -- be more explicit about what is permissible
and not permissible in this area, and do something with
pgstat_report_tempfile(). This is a bit like the
unlink()-ENOENT/-to-terminate (ENOENT ignore) issue. There are no
really hard questions here, but there certainly are some awkward
questions.

--
Peter Geoghegan

In response to

Re: WIP: [[Parallel] Shared] Hash at 2017-03-22 10:17:19 from Thomas Munro

Responses

Re: WIP: [[Parallel] Shared] Hash at 2017-03-23 07:35:09 from Thomas Munro

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Stephen Frost	2017-03-23 00:08:57	Re: increasing the default WAL segment size
Previous Message	Dilip Kumar	2017-03-22 23:53:19	Re: Enabling parallelism for queries coming from SQL or other PL functions