From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Rahila Syed <rahilasyed90(at)gmail(dot)com> |
Cc: | Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Enhancing Memory Context Statistics Reporting |
Date: | 2024-12-03 21:09:11 |
Message-ID: | 3b07073e-3fab-4f5c-8f22-94ca0dd07cbd@vondra.me |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 12/3/24 20:09, Rahila Syed wrote:
>
> Hi,
>
>
>
>
> > 4) I wonder if the function needs to return PID. I mean, the
> caller
> > knows which PID it is for, so it seems rather unnecessary.
> >
> > Perhaps it can be used to ascertain that the information indeed
> belongs to
> > the requested pid.
> >
>
> I find that a bit ... suspicious. By this logic we'd include the input
> parameters in every result, but we don't. So why is this case different?
>
>
> This was added to address a review suggestion. I had left it in case
> anyone found it useful
> for verification.
> Previously, I included a check for scenarios where multiple processes
> could write to the same
> shared memory. Now, each process has a separate shared memory space
> identified by
> pgprocno, making it highly unlikely for the receiving process to see
> another process's memory
> dump.
> Such a situation could theoretically occur if another process were
> mapped to the same
> pgprocno, although I’m not sure how likely that is. That said, I’ve
> added a check in the receiver
> to ensure the PID written in the shared memory matches the PID for which
> the dump is
> requested.
> This guarantees that a user will never see the memory dump of another
> process.
> Given this, I’m fine with removing the pid column if it helps to make
> the output more readable.
>
I'd just remove that. I agree it might have been useful with the single
chunk of shared memory, but I think with separate chunks it's not very
useful. And if we can end up with multiple processed getting the same
pgprocno I guess we have way bigger problems, this won't fix that.
> > 5) In the "summary" mode, it might be useful to include info
> about how
> > many child contexts were aggregated. It's useful to know
> whether there
> > was 1 child or 10000 children. In the regular (non-summary)
> mode it'd
> > always be "1", probably, but maybe it'd interact with the
> limit in (1).
> > Not sure.
> >
> > Sure, I will add this in the next iteration.
> >
>
> OK
>
>
> I have added this information as a column named "num_agg_contexts",
> which indicates
> the number of contexts whose statistics have been aggregated/added for a
> particular output.
>
> In summary mode, all the child contexts of a given level-1 context are
> aggregated, and
> their statistics are presented as part of the parent context's
> statistics. In this case,
> num_agg_contexts provides the count of all child contexts under a given
> level-1 context.
>
> In regular (non-summary) mode, this column shows a value of 1, meaning
> the statistics
> correspond to a single context, with all context statistics displayed
> individually. In this mode
> an aggregate result is displayed if the number of contexts exceed the
> DSA size limit. In
> this case the num_agg_contexts will display the number of the remaining
> contexts.
>
OK
> >
> > In the patch, since the timeout was set to a high value, pgbench ended
> > up stuck
> > waiting for the timeout to occur. The failure happens less frequently
> > after I added an
> > additional check for the process's existence, but it cannot be
> entirely
> > avoided. This is because a process can terminate after we check
> for its
> > existence but
> > before it signals the client. In such cases, the client will not
> receive
> > any signal.
> >
>
> Hmmm, I see. I guess there's no way to know if a process responds to us,
> but I guess it should be possible to wake up regularly and check if the
> process still exists? Wouldn't that solve the case you mentioned?
>
> I have fixed it accordingly in the attached patch by waking up after
> every 5 seconds
> to check if the process exists and sleeping again if the wake-up condition
> is not satisfied. The number of such tries is limited to 20. So, the
> total wait
> time can be 100 seconds. I will make the re-tries configurable, inline
> with your
> suggestion to be able to override the default waiting time.
>
Makes sense, although 100 seconds seems a bit weird, it seems we usually
pick "natural" values like 60s, or multiples of that. But if it's
configurable, that's not a huge issue.
Could the process wake up earlier than the timeout, say if it gets EINT
signal? That'd break the "total timeout is 100 seconds", and it would be
better to check that explicitly. Not sure if this can happen, though.
One thing I'd maybe consider is starting with a short timeout, and
gradually increasing it until e.g. 5 seconds (or maybe just 1 second
would be perfectly fine, IMHO). With the current coding it means we
either get the response right away, or wait 5+ seconds. That's a big
huge jump. If we start e.g. with 10ms, and then gradually multiply it by
1.2, it means we only wait "0-20% extra" on average.
But perhaps this is very unlikely and not worth the complexity.
>
> > I wonder if e.g. autovacuum launcher may
> > not be handling these requests, or what if client backends can
> wait in a
> > cycle.
> >
> >
> > Did not see a cyclic wait in client backends due to the pgbench
> stress test.
> >
>
> Not sure, but if I modify the query to only request memory contexts from
> non-client processes, i.e.
>
> SELECT * FROM pg_get_process_memory_contexts(
> (SELECT pid FROM pg_stat_activity
> WHERE pid != pg_backend_pid()
> AND backend_type != 'client backend'
> ORDER BY random() LIMIT 1)
> , false);
>
> then it gets stuck and reports this:
>
> pgbench -n -f select.sql -c 4 -T 10 test
> pgbench (18devel)
> WARNING: Wait for 105029 process to publish stats timed out, ...
>
> But process 105029 still very much exists, and it's the checkpointer:
>
> In the case of checkpointer, I also see some wait time after running the
> tests that you mentioned, but it eventually completes the request in my
> runs.
>
OK, but why should it even wait that long? Surely the checkpointer
should be able to report memory contexts too?
>
> $ ps ax | grep 105029
> 105029 ? Ss 0:00 postgres: checkpointer
>
> OTOH if I modify the script to only look at client backends, and wait
> until the processes get "stuck" (i.e. waiting on the condition variable,
> consuming 0% CPU), I get this:
>
> $ pgbench -n -f select.sql -c 4 -T 10 test
> pgbench (18devel)
> WARNING: Wait for 107146 process to publish stats timed out, try again
> WARNING: Wait for 107144 process to publish stats timed out, try again
> WARNING: Wait for 107147 process to publish stats timed out, try again
> transaction type: select.sql
> ...
>
> but when it gets 'stuck', most of the processes are still very much
> running (but waiting for contexts from some other process). In the above
> example I see this:
>
> 107144 ? Ss 0:02 postgres: user test [local] SELECT
> 107145 ? Ss 0:01 postgres: user test [local] SELECT
> 107147 ? Ss 0:02 postgres: user test [local] SELECT
>
> So yes, 107146 seems to be gone. But why would that block getting info
> from 107144 and 107147?
>
> Most likely 107144 and/or 107147 must also be waiting for 107146 which is
> gone. Something like 107144 -> 107147 -> 107146(dead) or 107144 -
>>107146(dead)
> and 107147->107146(dead).
>
I think I forgot to mention only 107145 was waiting for 107146 (dead),
and the other processes were waiting for 107145 in some way. But yeah,
detecting the dead process would improve this, although it also shows
the issues can "spread" easily.
OTOH it's unlikely to have multiple pg_get_process_memory_contexts()
queries pointing at each other like this - monitoring will just to that
from one backend, and that's it. So not a huge issue.
>
> Maybe that's acceptable, but couldn't this be an issue with short-lived
> connections, making it hard to implement the kind of automated
> collection of stats that you envision. If it hits this kind of timeouts
> often, it'll make it hard to reliably collect info. No?
>
>
> Yes, if there is a chain of waiting clients due to a process no longer
> existing,
> the waiting time to receive information will increase. However, as long
> as a failed
> a request caused by a non-existent process is detected promptly, the
> wait time should
> remain manageable, allowing other waiting clients to obtain the
> requested information
> from the existing processes.
>
> In such cases, it might be necessary to experiment with the waiting
> times at the receiving
> client. Making the waiting time user-configurable, as you suggested, by
> passing it as an
> argument to the function, could help address this scenario.
> Thanks for highlighting this, I will test this some more.
>
I think we should try very hard to make this work well without the user
having to mess with the timeouts. These are exceptional conditions that
happen only very rarely, which makes it hard to find good values.
>
> >
> > > I opted for DSAs over DSMs to enable memory reuse by freeing
> > > segments for subsequent statistics copies of the same backend,
> > > without needing to recreate DSMs for each request.
> >
> > I feel like this might be a premature optimization - I don't
> have a
> > clear idea how expensive it is to create DSM per request, but my
> > intuition is that it's cheaper than processing the contexts and
> > generating the info.
> >
> > I'd just remove that, unless someone demonstrates it really
> matters. I
> > don't really worry about how expensive it is to process a request
> > (within reason, of course) - it will happen only very rarely.
> It's more
> > important to make sure there's no overhead when no one asks
> the backend
> > for memory context info, and simplicity.
> >
> > Also, how expensive it is to just keep the DSA "just in case"?
> Imagine
> > someone asks for the memory context info once - isn't it a was
> to still
> > keep the DSA? I don't recall how much resources could that be.
> >
> > I don't have a clear opinion on that, I'm more asking for
> opinions.
> >
> >
> > Imagining a tool that periodically queries the backends for
> statistics,
> > it would be beneficial to avoid recreating the DSAs for each call.
>
> I think it would be nice if you backed this with some numbers. I mean,
> how expensive is it to create/destroy the DSA? How does it compare to
> the other stuff this function needs to do?
>
> After instrumenting the code with timestamps, I observed that DSA creation
> accounts for approximately 17% to 26% of the total execution time of the
> function
> pg_get_process_memory_contexts().
>
> > Currently, DSAs of size 1MB per process
> > (i.e., a maximum of 1MB * (MaxBackends + auxiliary processes))
> > would be created and pinned for subsequent reporting. This size does
> > not seem excessively high, even for approx 100 backends and
> > auxiliary processes.
> >
>
> That seems like a pretty substantial amount of memory reserved for each
> connection. IMHO the benefits would have to be pretty significant to
> justify this, especially considering it's kept "forever", even if you
> run the function only once per day.
>
> I can reduce the initial segment size to DSA_MIN_SEGMENT_SIZE, which is
> 256KB per process. If needed, this could grow up to 16MB based on the
> current settings.
>
> However, for the scenario you mentioned, it would be ideal to have a
> mechanism
> to mark a pinned DSA (using dsa_pin()) for deletion if it is not used/
> attached within a
> specified duration. Alternatively, I could avoid using dsa_pin()
> altogether, allowing the
> DSA to be automatically destroyed once all processes detach from it, and
> recreate it
> for a new request.
>
> At the moment, I am unsure which approach is most feasible. Any
> suggestions would be
> greatly appreciated.
>
I'm entirely unconcerned about the pg_get_process_memory_contexts()
performance, within some reasonable limits. It's something executed
every now and then - no one is going to complain it takes 10ms extra,
measure tps with this function, etc.
17-26% seems surprisingly high, but Even 256kB is too much, IMHO. I'd
just get rid of this optimization until someone complains and explains
why it's worth it.
Yes, let's make it fast, but I don't think we should optimize it at the
expense of "regular workload" ...
regards
--
Tomas Vondra
From | Date | Subject | |
---|---|---|---|
Next Message | Daniel Gustafsson | 2024-12-03 21:56:29 | Re: Replace current implementations in crypt() and gen_salt() to OpenSSL |
Previous Message | Daniel Gustafsson | 2024-12-03 21:07:21 | Re: code contributions for 2024, WIP version |