| From: | torikoshia <torikoshia(at)oss(dot)nttdata(dot)com> | 
|---|---|
| To: | James Coleman <jtc331(at)gmail(dot)com> | 
| Cc: | Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Greg Stark <stark(at)mit(dot)edu>, Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io>, david(dot)christensen(at)crunchydata(dot)com, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> | 
| Subject: | Re: RFC: Logging plan of the running query | 
| Date: | 2023-09-07 06:09:12 | 
| Message-ID: | cbe6eb783affb3e9b5068f5cf019ee62@oss.nttdata.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On 2023-09-06 11:17, James Coleman wrote:
>> > I've never been able to reproduce it (haven't tested the new version,
>> > but v28 at least) on my M1 Mac; where I've reproduced it is on Debian
>> > (first buster and now bullseye).
>> >
>> > I'm attaching several stacktraces in the hope that they provide some
>> > clues. These all match the ps output I sent earlier, though note in
>> > that output there is both the regress instance and my test instance
>> > (pid 3213249) running (different ports, of course, and they are from
>> > the exact same compilation run). I've attached ps output for the
>> > postgres processes under the make check process to simplify cross
>> > referencing.
>> >
>> > A few interesting things:
>> > - There's definitely a lock on a relation that seems to be what's
>> > blocking the processes.
>> > - When I try to connect with psql the process forks but then hangs
>> > (see the ps output with task names stuck in "authentication"). I've
>> > also included a trace from one of these.
>> 
>> Thanks for sharing them!
>> 
>> Many processes are waiting to acquire the LW lock, including the 
>> process
>> trying to output the plan(select1.trace).
>> 
>> I suspect that this is due to a lock that was acquired prior to being
>> interrupted by ProcessLogQueryPlanInterrupt(), but have not been able 
>> to
>> reproduce the same situation..
>> 
> 
> I don't have time immediately to dig into this, but perhaps loading up
> the core dumps would allow us to see what query is running in each
> backend process (if it hasn't already been discarded by that point)
> and thereby determine what point in each test process led to the error
> condition?
Thanks for the suggestion.
I am concerned that core dumps may not be readable on different 
operating systems or other environments. (Unfortunately, I do not have 
Debian on hand)
It seems that we can know what queries were running from the stack 
traces you shared.
As described above, I suspect a lock which was acquired prior to 
ProcessLogQueryPlanInterrupt() caused the issue.
I will try a little more to see if I can devise a way to create the same 
situation.
> Alternatively we might be able to apply the same trick to the test
> client instead...
> 
> BTW, for my own easy reference in this thread: relid 1259 is pg_class
> if I'm not mistaken.
Yeah, and I think it's strange that the lock to 1259 appears twice and 
must be avoided.
   #10 0x0000559d61d8ee6e in LockRelationOid (relid=1259, lockmode=1) at 
lmgr.c:117
   ..
   #49 0x0000559d61b4989d in ProcessLogQueryPlanInterrupt () at 
explain.c:5158
   ..
   #53 0x0000559d61d8ee6e in LockRelationOid (relid=1259, lockmode=1) at 
lmgr.c:117
-- 
Regards,
--
Atsushi Torikoshi
NTT DATA Group Corporation
| From | Date | Subject | |
|---|---|---|---|
| Next Message | John Naylor | 2023-09-07 06:14:02 | Re: Improve heapgetpage() performance, overhead from serializable | 
| Previous Message | Yugo NAGATA | 2023-09-07 06:02:49 | Re: psql help message contains excessive indentations |