Quick Links

Re: Get memory contexts of an arbitrary backend process

From:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	kasahara(dot)tatsuhito(at)gmail(dot)com, michael(at)paquier(dot)xyz, tomas(dot)vondra(at)2ndquadrant(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Get memory contexts of an arbitrary backend process
Date:	2020-10-28 06:32:15
Message-ID:	270ec80dbbebfb5c6f2b9fdd7a7858b5@oss.nttdata.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 2020-10-23 13:46, Kyotaro Horiguchi wrote:
> Wait...
>
>> Attachments:
>> 0003-Enabled-pg_get_backend_memory_contexts-to-collect.patch
>
> For a moment I thought that the number is patch number but the
> predecessors are 0002-Enabled..collect.patch and 0001-(same
> name). It's not mandatory but we usually do as the follows and it's
> the way of git.
>
> v1-0001-Enabled...collect.patch
> v2-0001-Enabled...collect.patch
>
> The vn is added by -v option for git-format-patch.

Sorry for the confusion. I'll follow that way next time.

> At Thu, 22 Oct 2020 21:32:00 +0900, torikoshia
> <torikoshia(at)oss(dot)nttdata(dot)com> wrote in
>> > > I added a shared hash table consisted of minimal members
>> > > mainly for managing whether the file is dumped or not.
>> > > Some members like 'loc' seem useful in the future, but I
>> > > haven't added them since it's not essential at this point.
>> > Yes, that would be good.
>> > + /*
>> > + * Since we allow only one session can request to dump
>> > memory context at
>> > + * the same time, check whether the dump files already exist.
>> > + */
>> > + while (stat(dumpfile, &stat_tmp) == 0 || stat(tmpfile,
>> > &stat_tmp) == 0)
>> > + {
>> > + pg_usleep(1000000L);
>> > + }
>> > If pg_get_backend_memory_contexts() is executed by two or more
>> > sessions at the same time, it cannot be run exclusively in this way.
>> > Currently it seems to cause a crash when do it so.
>> > This is easy to reproduce and can be done as follows.
>> > [session-1]
>> > BEGIN;
>> > LOCK TABKE t1;
>> > [Session-2]
>> > BEGIN;
>> > LOCK TABLE t1; <- waiting
>> > [Session-3]
>> > select * FROM pg_get_backend_memory_contexts(<pid of session-2>);
>> > [Session-4]
>> > select * FROM pg_get_backend_memory_contexts(<pid of
>> > session-2>);
>> > If you issue commit or abort at session-1, you will get SEGV.
>> > Instead of checking for the existence of the file, it might be better
>> > to use a hash (mcxtdumpHash) entry with LWLock.
>>
>> Thanks!
>> Added a LWLock and changed the way from checking the file existence
>> to finding the hash entry.
>
>> > + if (proc == NULL)
>> > + {
>> > + ereport(WARNING,
>> > + (errmsg("PID %d is not a PostgreSQL server
>> > process", dst_pid)));
>> > + return (Datum) 1;
>> > + }
>> > Shouldn't it clear the hash entry before return?
>>
>> Yeah. added codes for removing the entry.
>
> + entry = AddEntryToMcxtdumpHash(dst_pid);
> +
> + /* Check whether the target process is PostgreSQL backend process.
> */
> + /* TODO: Check also whether backend or not. */
> + proc = BackendPidGetProc(dst_pid);
> +
> + if (proc == NULL)
> + {
> + ereport(WARNING,
> + (errmsg("PID %d is not a PostgreSQL server process", dst_pid)));
> +
> + LWLockAcquire(McxtDumpHashLock, LW_EXCLUSIVE);
> +
> + if (hash_search(mcxtdumpHash, &dst_pid, HASH_REMOVE, NULL) == NULL)
> + elog(WARNING, "hash table corrupted");
> +
> + LWLockRelease(McxtDumpHashLock);
> +
> + return (Datum) 1;
> + }
>
> Why do you enter a useles entry then remove it immedately?

Do you mean I should check the process existence first
since it enables us to skip entering hash entries?

>
> + PG_ENSURE_ERROR_CLEANUP(McxtReqKill, (Datum)
> Int32GetDatum(dst_pid));
> + {
> + SendProcSignal(dst_pid, PROCSIG_DUMP_MEMORY, InvalidBackendId);
>
> "PROCSIG_DUMP_MEMORY" is somewhat misleading. Hwo about
> "PROCSIG_DUMP_MEMCXT" or "PROCSIG_DUMP_MEMORY_CONTEXT"?

I'll go with "PROCSIG_DUMP_MEMCXT".

>
> I thought that the hash table would prevent multiple reqestors from
> making a request at once, but the patch doesn't seem to do that.
>
> + /* Wait until target process finished dumping file. */
> + while (entry->dump_status == MCXTDUMPSTATUS_NOTYET)
>
> This needs LWLock. And this could read the entry after reused by
> another backend if the dumper process is gone. That isn't likely to
> happen, but theoretically the other backend may set it to
> MCXTDUMPSTATUS_NOTYET inbetween two successive check on the member.

Thanks for your notification.
I'll use an LWLock.

>
> + /*
> + * Make dump file ends with 'D'.
> + * This is checked by the caller when reading the file.
> + */
> + fputc('E', fpout);
>
> Which is right?

Sorry, the comment was wrong..

>
> + fputc('E', fpout);
> +
> + CHECK_FOR_INTERRUPTS();
>
> This means that the process accepts another request and rewrite the
> file even while the first requester is reading it. And, the file can
> be removed by the first requestor before the second requestor can read
> it.

I added CHECK_FOR_INTERRUPTS() here to make the dump cancellation
possible, however, considering your indication, it needs to think
about a way to handle only the dump cancellation.

>
> + mcxtdumpHash = ShmemInitHash("mcxtdump hash",
> + SHMEM_MEMCONTEXT_SIZE,
> + SHMEM_MEMCONTEXT_SIZE,
> .
> The space needed for this hash doesn't seem to be secured. The hash is
> sized to 64 entries, so pg_get_backend_memory_contexts() may fail with
> ERROR "out of shared memory".
>
> The hash is used only to check if the dump file is completed and if
> ended with error. If we need only those, an shared byte array with the
> length of max_backend is sufficient.

Although there was a comment that controlling dump location may be a
good idea, but it also seems possible to include the location
information
in the struct Hoge you suggested below.

On Tue, Sep 1, 2020 at 12:03 AM Kasahara Tatsuhito <[hidden email]>
wrote:
| - send a signal to dump to a PID, it first record following
information into the shared hash.
| pid (specified pid)
| loc (dump location, currently might be ASAP)

>
> + PG_ENSURE_ERROR_CLEANUP(McxtReqKill, (Datum)
> Int32GetDatum(dst_pid));
> + {
> + SendProcSignal(dst_pid, PROCSIG_DUMP_MEMORY, InvalidBackendId);
> +
> + /* Wait until target process finished dumping file. */
> + while (entry->dump_status == MCXTDUMPSTATUS_NOTYET)
> + {
> + CHECK_FOR_INTERRUPTS();
> + pg_usleep(10000L);
> + }
> + }
> + PG_END_ENSURE_ERROR_CLEANUP(McxtReqKill, (Datum)
> Int32GetDatum(dst_pid));
> +
> + if (entry->dump_status == MCXTDUMPSTATUS_ERROR)
> + {
> ..
> + if (stat(tmpfile, &stat_tmp) == 0)
> + unlink(tmpfile);
> + if (stat(dumpfile, &stat_tmp) == 0)
> + unlink(dumpfile);
> ...
> + return (Datum) 0;
> + }
> +
> + /* Read values from the dumped file and put them on tuplestore. */
> + PutDumpedValuesOnTuplestore(dumpfile, tupstore, tupdesc, dst_pid);
>
> This means that if the function gets sigint before the dumper creates
> the file, the dumper can leave a dump file?

In that case, the requestor removes the corresponding hash entry in the
callback McxtReqKill, then the dumper who can not find the hash entry
does not dump a file.

However, I've now noticed that when the requestor gets sigint just
after the dumper check, the dump file remains.

In the current design, only the requestor can remove the dump file,
but it seems necessary to allow the dumper to remove it.

>> > + /* Wait until target process finished dumping file. */
>> > + while (!entry->is_dumped)
>> > + {
>> > + CHECK_FOR_INTERRUPTS();
>> > + pg_usleep(10000L);
>> > + }
>> > If the target is killed and exit before dumping the memory
>> > information, you're in an infinite loop here.
>> > So how about making sure that any process that has to stop before
>> > doing a memory dump changes the status of the hash (entry->is_dumped)
>> > before stopping and the caller detects it?
>> > I'm not sure it's best or not, but you might want to use something
>> > like the on_shmem_exit callback.
>>
>> Thanks for your idea!
>> Added a callback to change the status of the hash table entry.
>>
>> Although I think it's necessary to remove this callback when it
>> finished
>> processing memory dumping, on_shmem_exit() does not seem to have such
>> a function.
>> I used before_shmem_exit() since it has a corresponding function to
>> remove registered callback.
>> If it's inappropriate, I'm going to add a function removing the
>> registered callback of on_shmem_exit().
>
> This seems to leave a file for the pid.

As mentioned above, there can be a chance to remain files.

>
>> > In the current design, if the caller stops processing before reading
>> > the dumped file, you will have an orphaned file.
>> > It looks like the following.
>> > [session-1]
>> > BEGIN;
>> > LOCK TABKE t1;
>> > [Session-2]
>> > BEGIN;
>> > LOCK TABLE t1; <- waiting
>> > [Session-3]
>> > select * FROM pg_get_backend_memory_contexts(<pid of session-2>);
>> > If you cancel or terminate the session-3, then issue commit or abort
>> > at session-1, you will get orphaned files in pg_memusage.
>> > So if you allow only one session can request to dump file, it could
>> > call pg_memusage_reset() before send the signal in this function.
>>
>> Although I'm going to allow only one session per one target process,
>> I'd like to allow running multiple pg_get_backend_memory_contexts()
>> which target process is different.
>>
>> Instead of calling pg_memusage_reset(), I added a callback for
>> cleaning up orphaned files and the elements of the hash table
>> using before_shmem_exit() through PG_ENSURE_ERROR_CLEANUP() and
>> PG_END_ENSURE_ERROR_CLEANUP().
>>
>> I chose PG_ENSURE_ERROR_CLEANUP() and PG_END_ENSURE_ERROR_CLEANUP()
>> here since it can handle not only termination but also cancellation.
>>
>> Any thoughts?
>
> +/*
> + * pg_memusage_reset
> + * Remove the memory context dump files.
> + */
> +void
> +pg_memusage_reset(int pid)
>
> The function name seem to represents somewhat different from what it
> does.

Yeah, It looks like it actually reset the memory.
I'll rename it to remove_memcxt_file or something.

>
> I think we might need to step-back to basic design of this feature
> since this patch seems to have unhandled corner cases that are
> difficult to find.

Agreed and thanks for writing it down below.

> - Dump file lifecycle or state-transition of the dumper
>
> Currently the lifecycle of a dump file, or the state-transition of
> the dumper process doesn't seem to be well defined.
>
> The file is create only by the dumper.
> If the requestor reads the completed file, the reader removes it.
>
> If the dumper receives a cancel request, the dumper removes it if
> any.

>
> Of course dumper removes it if it fails to complete the file.
>
> - Better way of signalling between the requestor and the dumper
>
> I think there's no doubt about request signal.
>
> About the complete signal, currently the requestor polls on a flag
> in a hash entry. I'm wondering if we can get rid of polling. The
> problem on doing that is the lack of a means for a dumper to know
> the requestor. We need to store requestor pid or backend id in the
> shared hash entry.

Agreed to get rid of polling.

BTW, it seems common to use a latch instead of pg_usleep() to wait until
signals arrive as far as I read latch.h.
I'm now thinking about using a latch here and it would make polling
removed.

> By the way, about shared hash entry, it uses about 70kB for only 64
> entries so it seems inefficient than a shared array that has
> MaxBackends entries. If we used a following array on shared memory,
>
> struct hoge
> {
> BackendId requestor[MAX_BACKENDS];
> int status[MAX_BACKENDS];
> LWLock lock;
> };

If the requestor's id is 5 and dumper's id is 10,
is this struct used like below?

- hoge.requestor[10] = 5
- Both status[5] and status[10] change like "request", "idle" or "done"

> This array has the size of 24 * MaxBackends + 16. 24kB for 1000
> backends. It could be on dsm since this feature is not used
> commonly.

Sorry but I'm not sure this calculation.
Do you mean 16.24kB for 1000 backends?

Regarding dsm, do you imagine using hoge on dsm?

>
> - The way to cancel a request already made. (or management of the dump
> state transition.)
>
> Cancellation seems to contain some race conditions. But basically
> that could be done by sending a request signal after setting the
> hoge.requestor above to some special value, not needing the third

I imagined hoge.requestor was set to requestor's backendid.
Isn't it hoge.status?

> type of signal. The special value should be different from the
> initial state, which signals that the process is accepting a new
> request.
>
> As the whole, that would looks like the folloing?
>
> ------------------------------------------------------------
> Successful request.
>
> Requestor dumper state
> [idle] initial
> [request] -------------------> requestor pid/backendid
> -signal->
> [dumping]
> <-signal-[done]
> [read]
> [done] --------------------> initial
>
> ------------------------------------------------------------
> On failure, the dumper signals with setting state to initial.
> [request] -------------------> requestor pid/backendid
> -signal->
> [dumping]
> [failed] initial
> <-signal-
> (some other requestor might come meantime.)
> <sees that the requestor is not me>

Is this "some other requestor come case" relevant to the dumper failure?

Regards,

--
Atsushi Torikoshi

> [failed]
>
> ------------------------------------------------------------
> If the requestor wants to cancel the request, it sets the state to
> 'cancel' then signal.
>
> Requestor dumper state
> [idle] initial
> [request] -------------------> cancel
> <if canceled. clean up>
> [dumping]
> <if canceled. clean up>
> <-signal-[done]
>
> -signal-><try to clean up>
>
>
> Other aspects to cnosider?
>
> regards.

In response to

Re: Get memory contexts of an arbitrary backend process at 2020-10-23 04:46:11 from Kyotaro Horiguchi

Responses

Re: Get memory contexts of an arbitrary backend process at 2020-11-10 15:07:13 from Georgios Kokolatos

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2020-10-28 06:39:57	Re: Support for NSS as a libpq TLS backend
Previous Message	Tatsuro Yamada	2020-10-28 06:07:56	Re: list of extended statistics on psql