From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Proposal: Progressive explain |
Date: | 2025-01-08 01:03:25 |
Message-ID: | 5a9ea19e-7df8-4f44-b1ca-1ab4ff5dde99@vondra.me |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi Rafael,
This sounds like a great feature, thanks for working on it and sharing
the patch. Let me share some comments / questions after a quick review.
On 12/30/24 02:18, Rafael Thofehrn Castro wrote:
> Hello community,
>
> CONTEXT:
>
> Back in October I presented the talk "Debugging active queries with
> mid-flight instrumented explain plans" at PGConf EU 2024
> (recording: https://www.youtube.com/watch?v=6ahTb-7C05c <https://
> www.youtube.com/watch?v=6ahTb-7C05c>) presenting
> an experimental feature that enables visualization of in progress
> EXPLAIN ANALYZE executions. Given the positive feedback and requests,
> I am sending this patch with the feature, which I am calling Progressive
> Explain.
>
Really nice talk, I enjoyed watching it a couple weeks back.
> PROPOSAL:
>
> This proposal introduces a feature to print execution plans of active
> queries in an in-memory shared hash object so that other sessions can
> visualize them with a new view: pg_stat_progress_explain.
>
> Plans are only printed if the new GUC parameter progressive_explain is
> enabled.
>
Aren't the names of the view / GUC are a bit misleading? Because this is
really about EXPLAIN ANALYZE, not just EXPLAIN. Not sure.
> For regular queries or queries started with EXPLAIN (without ANALYZE)
> the plan is printed only once at the start.
>
Initially I thought this is a bit weird, and also inconsistent with the
auto_explain (which prints plans at the end). But I think for "progress"
that makes sense, even if it's not updated after that.
> For instrumented runs (started via EXPLAIN ANALYZE or when auto_explain
> flag log_analyze is enabled), the plan is printed on a fixed interval
> controlled by the new GUC parameter progressive_explain_interval. This plan
> includes all instrumentation stats computed so far (per node rows and
> execution time).
>
OK, I understand why it works like this. But I rather dislike that it
relies on auto_explain to enable the progressive updates. If we want to
make this dependent on auto_explain, then most of this could/should be
moved to auto_explain, and called through a hook. Or something similar.
If we want to add this to core, then I think the progressive_explain GUC
should have "off|explain|analyze" values, or a similar way to enable the
instrumentation and regular updates. But that would probably also need
to duplicate a number of auto_explain options (e.g. to allow disabling
timings).
> New view:
> - pg_stat_progress_explain
> - pid: PID of the process running the query
> - last_explain: timestamp when plan was last printed
> - explain_count: amount of times plan was printed
> - total_explain_time: accumulated time spend printing plans (in ms)
> - explain: the actual plan (limited read privileges)
>
I find the "explain_time" a bit misleading. On the one hand - yes, it
does account for time spent generating the EXPLAIN output. But on the
other hand, that's only a part of the overhead - it ignores the overhead
of the extra instrumentation. Also, wouldn't it be better to rename
"explain" to "query plan"? That's what the EXPLAIN result says too.
> New GUCs:
> - progressive_explain: if progressive plans are printed for local
> session.
> - type: bool
> - default: off
> - context: user
> - progressive_explain_interval: interval between each explain print.
> - type: int
> - default: 1s
> - min: 10ms
> - context: user
Seems reasonable (but I already commented on progressive_explain).
> - progressive_explain_sample_rate: fraction of rows processed by the
> query until progressive_explain_interval is evaluated to print a
> progressive plan
> - type: floating point
> - default: 0.1
> - range: (0.0 - 1.0)
> - context: user
I find this pretty weird / unnecessary. More comments later.
> - progressive_explain_output_size: max output size of the plan
> printed in the in-memory hash table.
> - type: int
> - default: 4096
> - min: 100
> - context: postmaster
Seems far too small.
> - progressive_explain_format: format used to print the plans.
> - type: enum
> - default: text
> - context: user
Good idea. When processing the view automatically, JSON would be much
better than text.
> - progressive_explain_settings: controls whether information about
> modified configuration is added to the printed plan.
> - type: bool
> - default: off
> - context: user
> - progressive_explain_verbose: controls whether verbose details are
> added to the printed plan.
> - type: bool
> - default: off
> - context: user
>
This seems to be the duplication of some auto_explain parameters that I
mentioned above. Except that it only duplicates some of them. I'd like
the ability to disable collecting timing info, which is usually by far
the main overhead.
> DEMONSTRATION:
>
> postgres=# SET progressive_explain = ON;
> SET
> postgres=# EXPLAIN ANALYZE SELECT *
> FROM test t1
> UNION ALL
> SELECT *
> FROM test t1;
>
> postgres=# select * from pg_stat_progress_explain;
> -[ RECORD 1 ]------
> +---------------------------------------------------------------------------------------------------------------------------------------
> pid | 299663
> last_explain | 2024-12-29 22:40:33.016833+00
> explain_count | 5
> total_explain_time | 0.205
> explain | Append (cost=0.00..466670.40 rows=20000160
> width=37) (actual time=0.052..3372.318 rows=14013813 loops=1)
> | Buffers: shared hit=4288 read=112501
> | -> Seq Scan on test t1 (cost=0.00..183334.80
> rows=10000080 width=37) (actual time=0.052..1177.428 rows=10000000 loops=1)
> | Buffers: shared hit=4288 read=79046
> | -> Seq Scan on test t1_1 (cost=0.00..183334.80
> rows=10000080 width=37) (actual time=0.072..608.481 rows=4013813
> loops=1) (current)
> | Buffers: shared read=33455
> |
>
> IMPLEMENTATION HIGHLIGHTS:
>
> - The initial plan is printed at the end of standard_ExecutorStart
> if progressive_explain is enabled, for both regular queries and
> instrumented ones (EXPLAIN ANALYZE):
>
> /*
> * Start progressive explain if enabled.
> */
> if (progressive_explain)
> ProgressiveExplainBegin(queryDesc);
>
> - The incremental plan print for instrumented runs uses a dedicated
> ExecProcNode if progressive_explain is enabled:
>
> if (node->instrument)
> if (progressive_explain)
> node->ExecProcNode = ExecProcNodeInstrExplain;
> else
> node->ExecProcNode = ExecProcNodeInstr;
> else
> node->ExecProcNode = node->ExecProcNodeReal;
>
> - ExecProcNodeInstrExplain is identical to ExecProcNodeInstr with an
> additional part to print plans based on a sampling logic:
>
I haven't looked at the details yet, but this seems mostly reasonable.
> /*
> * Update progressive explain based on sampling.
> */
> if (pg_prng_double(&pg_global_prng_state) < progressive_explain_sample_rate)
> ProgressiveExplainUpdate(node);
>
> That logic was added because ExecProcNodeInstrExplain is called once per
> row processed (a lot of times) and performing the timestamp interval
> check with progressive_explain_interval to decide whether to print
> the plan (done inside ProgressiveExplainUpdate) is expensive. Benchmarks
> (shared at the end of this email) show that sampling + timestamp check
> gives much better results than performing the timestamp check at every
> ExecProcNodeInstrExplain call.
>
I don't think this is the way to deal with the high cost of collecting
timing information. It just adds unpredictability, because how would you
know what's a good sample rate? What if you pick a far too low value,
and then a plan gets logged much less frequently because for that
particular query it takes much longer to return a tuple?
I think the right solution to deal with high cost of timing are
timeouts. Set a timeout, with a handler that sets a "pending" flag, and
check the flag before calling ProgressiveExplainUpdate(). See for
example TransactionTimeoutHandler in postinit.c.
> - The plans are stored in a shared hash object (explainArray) allocated
> at database start, similar to procArray. ExplainHashShmemSize() computes
> shared memory needed for it, based on max_connections + max_parallel_workers
> for the amount of elements in the array and progressive_explain_output_size
> for the size per element.
>
Why do we need a shared hash table?
> - A memory context release callback is configured in the memory context
> where the query is running, being responsible for updating explainArray
> even when the query doesn't finish gracefully.
>
OK
> - Instrumented plans being printed incrementally need to clone
> instrumentation
> objects to change them, so each print uses a dedicated memory context
> that gets released after the output is constructed. This avoids extended
> private memory usage:
>
Yeah. I was wondering how you're going to deal with this. Wouldn't it be
cheaper to just use a static variable? I don't think this is called
recursively, and that'd save the palloc/pfree. Haven't tried and not
sure if it's worth it.
>
> /* Dedicated memory context for the current plan being printed */
> tmpCxt = AllocSetContextCreate(CurrentMemoryContext,
> "Progressive Explain Temporary Context",
> ALLOCSET_DEFAULT_SIZES);
>
Maybe it'd be better to keep this memory context for the query duration
(e.g. by adding it to queryDesc), and just reset it before the calls?
That'd cache the memory, and it shouldn't really use a lot of it, no?
Also, ProgressiveExplainPrint() does this at the end:
+ /*
+ * Free local explain state before exiting as this function may be
+ * called multiple times in the same memory context.
+ */
+ pfree(es->str);
+ pfree(es);
Isn't this actually pointless with the local memory context?
> - A new version of InstrEndLoop (InstrEndLoopForce) was created that allows
> to be called targeting in-progress instrumented objects. Those are common
> when traversing the plan tree of an active query.
>
No opinion yet.
> - Column explain from pg_stat_progress_explain can only be visualized by
> superusers or the same role that is running the query. If none of those
> conditions are met, users will see "<insufficient privilege>".
>
I think this is in line with how we restrict access to similar catalogs.
> - For instrumented runs, the printed includes 2 per node modifiers when
> appropriate:
>
> <current>: the plan node currently being processed.
> <never executed>: a plan node not processed yet.
>
Not sure the <current> label is all that useful. It seems misleading at
best, because it's simply the last node that generated the explain. But
we might have already moved to a different node. And that node may be
stuck / very expensive, yet we'll see the plan as seemingly waiting in
some other node.
> IMPLEMENTATION OVERHEAD:
>
> When not used, the overhead added is:
>
> - One IF at standard_ExecutorStart to check if progressive_explain is
> enabled
> - For instrumented runs (EXPLAIN ANALYZE), one IF at ExecProcNodeFirst
> to define ExecProcNode wrapper
>
> BENCHMARKS:
>
> Performed 3 scenarios of benchmarks:
>
> A) Comparison between unpatched PG18, patched with progressive explain
> disabled and patched with feature enabled globally (all queries printing
> the plan at query start:
>
> - PG18 without patch:
>
> postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -S -n -T 120 -c 30
> number of transactions actually processed: 2173978
> latency average = 1.655 ms
> tps = 18127.977529 (without initial connection time)
>
> - PG18 with patch:
>
> -- progressive_explain = off
>
> postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -S -n -T 120 -c 30
> number of transactions actually processed: 2198806
> latency average = 1.636 ms
> tps = 18333.157809 (without initial connection time)
>
> -- progressive_explain = on (prints plan only once per query)
>
> postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -S -n -T 120 -c 30
> number of transactions actually processed: 2047459
> latency average = 1.756 ms
> tps = 17081.477199 (without initial connection time)
>
> B) EXPLAIN ANALYZE performance with different progressive_explain_interval
> settings in patched:
>
> -- progressive_explain = off
>
> postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1
> -f script.sql
> number of transactions actually processed: 27
> latency average = 4492.845 ms
>
> -- progressive_explain = on
> -- progressive_explain_interval = 1s (default)
> -- progressive_explain_sample_rate = 0.01 (default)
>
> postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 300 -c 1
> -f script.sql
> number of transactions actually processed: 26
> latency average = 4656.067 ms
>
> -- progressive_explain = on
> -- progressive_explain_interval = 10ms
> -- progressive_explain_sample_rate = 0.01 (default)
>
> postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 300 -c 1
> -f script.sql
> number of transactions actually processed: 26
> latency average = 4785.608 ms
>
> C) EXPLAIN ANALYZE performance in patched with and without
> progressive_explain_sample_rate, ie, sampling with 2 different values
> and also no sampling logic:
>
> -- progressive_explain = off
>
> postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1
> -f script.sql
> number of transactions actually processed: 27
> latency average = 4492.845 ms
>
> -- progressive_explain = on
> -- progressive_explain_interval = 1s (default)
> -- progressive_explain_sample_rate = 0.01 (default)
>
> postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1
> -f script.sql
> number of transactions actually processed: 26
> latency average = 4656.067 ms
>
> -- progressive_explain = on
> -- progressive_explain_interval = 1s (default)
> -- progressive_explain_sample_rate = 1
>
> postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1
> -f script.sql
> number of transactions actually processed: 19
> latency average = 6432.902 ms
>
> -- progressive_explain = on
> -- progressive_explain_interval = 1s (default)
> -- NO SAMPLING LOGIC
>
> postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1
> -f script.sql
> number of transactions actually processed: 21
> latency average = 5864.820 ms
>
> BENCHMARK RESULTS:
>
> It definitely needs more testing but preliminary results show that:
>
> - From (A) we see that the patch adds negligible overhead when the
> feature is not used. Enabling globally reduces overall TPS as all
> queries are spending time printing the plan. The idea is to enable
> progressive_explain on a per-need basis, only to a subset of sessions
> that need it.
>
> - From (B) we see that using progressive explain slightly increases
> total execution time. Difference between using progressive_explain_interval
> set to 1s (plan printed 4 times per query in the test) and to
> 10ms (plan printed ~460 times per query in the test) is very small.
> The actual overhead appears when changing progressive_explain_sample_rate.
>
> - From (C) we see that progressive_explain_sample_rate with a low
> value (default 0.01) performs better than not using sampling or
> using progressive_explain_sample_rate = 1. So the overhead of having
> the sampling logic is much lower than not sampling at all.
>
No opinion. I need to do some testing / benchmarking myself.
> TESTS:
>
> Currently working on tests for a second version of the patch.
>
> DOCUMENTATION:
>
> Added documentation for the new view pg_stat_progress_explain,
> new GUCs and a new item in section 14.1:
>
> 14.1. Using EXPLAIN
> 14.1.1. EXPLAIN Basics
> 14.1.2. EXPLAIN ANALYZE
> 14.1.3. Caveats
> 14.1.4. Progressive EXPLAIN
>
> FURTHER DISCUSSION:
>
> Considering that this patch introduces a new major feature with
> several new components (view, GUCs, etc), there is open room for
> discussion such as:
>
> - Do the columns in pg_stat_progress_explain make sense? Are we
> missing or adding unnecessary columns?
>
> - Do the new GUCs make sense and are their default values appropriate?
>
> - Do we want progressive explain to print plans of regular queries
> started without EXPLAIN if progressive_explain is enabled or should
> the feature be restricted to instrumented queries (EXPLAIN ANALYZE)?
>
> - Is the size of explainHash based on max_connections + max_parallel_workers
> large enough or are there other types of backends that use the
> executor and will print plans too?
>
I've commented on some of these items earlier.
regards
--
Tomas Vondra
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2025-01-08 02:10:50 | Reorder shutdown sequence, to flush pgstats later |
Previous Message | John Naylor | 2025-01-08 00:29:46 | Sort functions with specialized comparators |