Re: Enhancing Memory Context Statistics Reporting

From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Enhancing Memory Context Statistics Reporting
Date: 2024-12-03 19:09:02
Message-ID: CAH2L28spk4c54k5pTSWRqu5LJmetQwTXtr=-F7Bnioxd425hqQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

>
>
> > 4) I wonder if the function needs to return PID. I mean, the caller
> > knows which PID it is for, so it seems rather unnecessary.
> >
> > Perhaps it can be used to ascertain that the information indeed belongs
> to
> > the requested pid.
> >
>
> I find that a bit ... suspicious. By this logic we'd include the input
> parameters in every result, but we don't. So why is this case different?
>
>
This was added to address a review suggestion. I had left it in case anyone
found it useful
for verification.
Previously, I included a check for scenarios where multiple processes could
write to the same
shared memory. Now, each process has a separate shared memory space
identified by
pgprocno, making it highly unlikely for the receiving process to see
another process's memory
dump.
Such a situation could theoretically occur if another process were mapped
to the same
pgprocno, although I’m not sure how likely that is. That said, I’ve added a
check in the receiver
to ensure the PID written in the shared memory matches the PID for which
the dump is
requested.
This guarantees that a user will never see the memory dump of another
process.
Given this, I’m fine with removing the pid column if it helps to make the
output more readable.

> 5) In the "summary" mode, it might be useful to include info about how
> > many child contexts were aggregated. It's useful to know whether
> there
> > was 1 child or 10000 children. In the regular (non-summary) mode it'd
> > always be "1", probably, but maybe it'd interact with the limit in
> (1).
> > Not sure.
> >
> > Sure, I will add this in the next iteration.
> >
>
> OK
>

I have added this information as a column named "num_agg_contexts", which
indicates
the number of contexts whose statistics have been aggregated/added for a
particular output.

In summary mode, all the child contexts of a given level-1 context are
aggregated, and
their statistics are presented as part of the parent context's statistics.
In this case,
num_agg_contexts provides the count of all child contexts under a given
level-1 context.

In regular (non-summary) mode, this column shows a value of 1, meaning the
statistics
correspond to a single context, with all context statistics displayed
individually. In this mode
an aggregate result is displayed if the number of contexts exceed the DSA
size limit. In
this case the num_agg_contexts will display the number of the remaining
contexts.

>
> > In the patch, since the timeout was set to a high value, pgbench ended
> > up stuck
> > waiting for the timeout to occur. The failure happens less frequently
> > after I added an
> > additional check for the process's existence, but it cannot be entirely
> > avoided. This is because a process can terminate after we check for its
> > existence but
> > before it signals the client. In such cases, the client will not receive
> > any signal.
> >
>
> Hmmm, I see. I guess there's no way to know if a process responds to us,
> but I guess it should be possible to wake up regularly and check if the
> process still exists? Wouldn't that solve the case you mentioned?
>
> I have fixed it accordingly in the attached patch by waking up after every
5 seconds
to check if the process exists and sleeping again if the wake-up condition
is not satisfied. The number of such tries is limited to 20. So, the total
wait
time can be 100 seconds. I will make the re-tries configurable, inline with
your
suggestion to be able to override the default waiting time.

> > I wonder if e.g. autovacuum launcher may
> > not be handling these requests, or what if client backends can wait
> in a
> > cycle.
> >
> >
> > Did not see a cyclic wait in client backends due to the pgbench stress
> test.
> >
>
> Not sure, but if I modify the query to only request memory contexts from
> non-client processes, i.e.
>
> SELECT * FROM pg_get_process_memory_contexts(
> (SELECT pid FROM pg_stat_activity
> WHERE pid != pg_backend_pid()
> AND backend_type != 'client backend'
> ORDER BY random() LIMIT 1)
> , false);
>
> then it gets stuck and reports this:
>
> pgbench -n -f select.sql -c 4 -T 10 test
> pgbench (18devel)
> WARNING: Wait for 105029 process to publish stats timed out, ...
>
> But process 105029 still very much exists, and it's the checkpointer:
>
> In the case of checkpointer, I also see some wait time after running the
tests that you mentioned, but it eventually completes the request in my
runs.

> $ ps ax | grep 105029
> 105029 ? Ss 0:00 postgres: checkpointer
>
> OTOH if I modify the script to only look at client backends, and wait
> until the processes get "stuck" (i.e. waiting on the condition variable,
> consuming 0% CPU), I get this:
>
> $ pgbench -n -f select.sql -c 4 -T 10 test
> pgbench (18devel)
> WARNING: Wait for 107146 process to publish stats timed out, try again
> WARNING: Wait for 107144 process to publish stats timed out, try again
> WARNING: Wait for 107147 process to publish stats timed out, try again
> transaction type: select.sql
> ...
>
> but when it gets 'stuck', most of the processes are still very much
> running (but waiting for contexts from some other process). In the above
> example I see this:
>
> 107144 ? Ss 0:02 postgres: user test [local] SELECT
> 107145 ? Ss 0:01 postgres: user test [local] SELECT
> 107147 ? Ss 0:02 postgres: user test [local] SELECT
>
> So yes, 107146 seems to be gone. But why would that block getting info
> from 107144 and 107147?
>
> Most likely 107144 and/or 107147 must also be waiting for 107146 which is
gone. Something like 107144 -> 107147 -> 107146(dead) or 107144
->107146(dead)
and 107147->107146(dead).

Maybe that's acceptable, but couldn't this be an issue with short-lived
> connections, making it hard to implement the kind of automated
> collection of stats that you envision. If it hits this kind of timeouts
> often, it'll make it hard to reliably collect info. No?

Yes, if there is a chain of waiting clients due to a process no longer
existing,
the waiting time to receive information will increase. However, as long as
a failed
a request caused by a non-existent process is detected promptly, the wait
time should
remain manageable, allowing other waiting clients to obtain the requested
information
from the existing processes.

In such cases, it might be necessary to experiment with the waiting times
at the receiving
client. Making the waiting time user-configurable, as you suggested, by
passing it as an
argument to the function, could help address this scenario.
Thanks for highlighting this, I will test this some more.

> >
> > > I opted for DSAs over DSMs to enable memory reuse by freeing
> > > segments for subsequent statistics copies of the same backend,
> > > without needing to recreate DSMs for each request.
> >
> > I feel like this might be a premature optimization - I don't have a
> > clear idea how expensive it is to create DSM per request, but my
> > intuition is that it's cheaper than processing the contexts and
> > generating the info.
> >
> > I'd just remove that, unless someone demonstrates it really matters.
> I
> > don't really worry about how expensive it is to process a request
> > (within reason, of course) - it will happen only very rarely. It's
> more
> > important to make sure there's no overhead when no one asks the
> backend
> > for memory context info, and simplicity.
> >
> > Also, how expensive it is to just keep the DSA "just in case"?
> Imagine
> > someone asks for the memory context info once - isn't it a was to
> still
> > keep the DSA? I don't recall how much resources could that be.
> >
> > I don't have a clear opinion on that, I'm more asking for opinions.
> >
> >
> > Imagining a tool that periodically queries the backends for statistics,
> > it would be beneficial to avoid recreating the DSAs for each call.
>
> I think it would be nice if you backed this with some numbers. I mean,
> how expensive is it to create/destroy the DSA? How does it compare to
> the other stuff this function needs to do?
>
> After instrumenting the code with timestamps, I observed that DSA creation
accounts for approximately 17% to 26% of the total execution time of the
function
pg_get_process_memory_contexts().

> Currently, DSAs of size 1MB per process
> > (i.e., a maximum of 1MB * (MaxBackends + auxiliary processes))
> > would be created and pinned for subsequent reporting. This size does
> > not seem excessively high, even for approx 100 backends and
> > auxiliary processes.
> >
>
> That seems like a pretty substantial amount of memory reserved for each
> connection. IMHO the benefits would have to be pretty significant to
> justify this, especially considering it's kept "forever", even if you
> run the function only once per day.
>
> I can reduce the initial segment size to DSA_MIN_SEGMENT_SIZE, which is
256KB per process. If needed, this could grow up to 16MB based on the
current settings.

However, for the scenario you mentioned, it would be ideal to have a
mechanism
to mark a pinned DSA (using dsa_pin()) for deletion if it is not
used/attached within a
specified duration. Alternatively, I could avoid using dsa_pin()
altogether, allowing the
DSA to be automatically destroyed once all processes detach from it, and
recreate it
for a new request.

At the moment, I am unsure which approach is most feasible. Any suggestions
would be
greatly appreciated.

Thank you,
Rahila Syed

Attachment Content-Type Size
v7-Function-to-report-memory-context-stats-of-any-backe.patch application/octet-stream 41.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2024-12-03 19:19:29 Re: simplify regular expression locale global variables
Previous Message Nikita Malakhov 2024-12-03 18:24:37 Precheck to consider path costs for partial paths