Re: Proposal: Progressive explain

From: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andrei Lepikhov <lepihov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-04-01 13:38:23
Message-ID: 78cff2165abb881244376988893a87b8@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2025-04-01 15:23, Rafael Thofehrn Castro wrote:
> Hello again,
>
>> ERROR: could not attach to dynamic shared area
>
> In addition to that refactoring issue, the current patch had a race
> condition in pg_stat_progress_explain to access the DSA of a process
> running a query that gets aborted.
>
> While discussing with Robert we agreed that it would be wiser to take
> a step back and change the strategy used to share progressive explain
> data in shared memory.
>
> Instead of using per backend's DSAs shared via a hash structure I now
> define a dsa_pointer and a LWLock in each backend's PGPROC.
>
> A global DSA is created by the first backend that attempts to use
> the progressive explain feature. After the DSA is created, subsequent
> uses of the feature will just allocate memory there and reference
> via PGPROC's dsa_pointer.
>
> This solves the race condition reported by Torikoshi and improves
> concurrency performance as now we don't have a global LWLock
> controlling shared memory access, but a per-backend LWLock.
>
> Performed the same tests done by Torikoshi and it looks like we are
> good now. Even with more frequent inspects in pg_stat_progress_explain
> (\watch 0.01).

Thanks for updating the patch!
Have you tested enabling progressive_explain?

When I ran the 'make installcheck' test again setting
progressive_explain to on, there was the same assertion failure:

TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File:
"ruleutils.c", Line: 8802, PID: 116832
postgres: parallel worker for PID 116822
(ExceptionalCondition+0x98)[0xb7311ea8bf80]
postgres: parallel worker for PID 116822 (+0x89de6c)[0xb7311e9ede6c]
postgres: parallel worker for PID 116822 (+0x89eb68)[0xb7311e9eeb68]
postgres: parallel worker for PID 116822 (+0x89e78c)[0xb7311e9ee78c]
postgres: parallel worker for PID 116822 (+0x8a1d10)[0xb7311e9f1d10]
postgres: parallel worker for PID 116822 (+0x89ed80)[0xb7311e9eed80]
postgres: parallel worker for PID 116822 (+0x89e78c)[0xb7311e9ee78c]
postgres: parallel worker for PID 116822 (+0x89f174)[0xb7311e9ef174]
postgres: parallel worker for PID 116822 (+0x89e78c)[0xb7311e9ee78c]
postgres: parallel worker for PID 116822 (+0x89f0b8)[0xb7311e9ef0b8]
postgres: parallel worker for PID 116822 (+0x8928dc)[0xb7311e9e28dc]
postgres: parallel worker for PID 116822
(deparse_expression+0x34)[0xb7311e9e2834]
postgres: parallel worker for PID 116822 (+0x347870)[0xb7311e497870]
postgres: parallel worker for PID 116822 (+0x3478e4)[0xb7311e4978e4]
postgres: parallel worker for PID 116822 (+0x347970)[0xb7311e497970]
...
TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File:
"ruleutils.c", Line: 8802, PID: 116831
[115650] LOG: 00000: background worker "parallel worker" (PID 116831)
was terminated by signal 6: Aborted
[115650] DETAIL: Failed process was running: explain (analyze, costs
off, summary off, timing off, buffers off) select count(*) from ab where
(a = (select 1) or a = (select 3)) and b = 2
[115650] LOCATION: LogChildExit, postmaster.c:2846

We can reproduce it as follows:

show progressive_explain;
progressive_explain
---------------------
on

create table ab (a int not null, b int not null) partition by list
(a);
create table ab_a2 partition of ab for values in(2) partition by list
(b);
create table ab_a2_b1 partition of ab_a2 for values in (1);
create table ab_a2_b2 partition of ab_a2 for values in (2);
create table ab_a2_b3 partition of ab_a2 for values in (3);
create table ab_a1 partition of ab for values in(1) partition by list
(b);
create table ab_a1_b1 partition of ab_a1 for values in (1);
create table ab_a1_b2 partition of ab_a1 for values in (2);
create table ab_a1_b3 partition of ab_a1 for values in (3);
create table ab_a3 partition of ab for values in(3) partition by list
(b);
create table ab_a3_b1 partition of ab_a3 for values in (1);
create table ab_a3_b2 partition of ab_a3 for values in (2);
create table ab_a3_b3 partition of ab_a3 for values in (3);

set parallel_setup_cost = 0;
set parallel_tuple_cost = 0;
set min_parallel_table_scan_size = 0;
set max_parallel_workers_per_gather = 2;

explain (analyze, costs off, summary off, timing off, buffers off)
select count(*) from ab where (a = (select 1) or a = (select 3)) and b =
2;
WARNING: terminating connection because of crash of another server
process
DETAIL: The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Note that there is no need to access pg_stat_progress_explain.

Could you please check if you can reproduce this?

--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2025-04-01 13:45:23 Re: AIO v2.5
Previous Message Robert Haas 2025-04-01 13:37:30 Re: Statistics Import and Export