RE: FW: query pg_stat_ssl hang 100%cpu

From: "James Pang (chaolpan)" <chaolpan(at)cisco(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: RE: FW: query pg_stat_ssl hang 100%cpu
Date: 2023-09-07 08:54:09
Message-ID: PH0PR11MB5191FAF4D1FF00F918018F5ED6EEA@PH0PR11MB5191.namprd11.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-performance

Yes, this backend has been always on same call stack tens of hours and 100% cpu there. It's still hang there now, but I can not reproduce that in other similar environment. I found this query start a transaction "xact_start" from " 2023-09-03 02:36:23" from pg_stat_activity, no idea why.

-----Original Message-----
From: Michael Paquier <michael(at)paquier(dot)xyz>
Sent: Thursday, September 7, 2023 12:05 PM
To: James Pang (chaolpan) <chaolpan(at)cisco(dot)com>
Cc: PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: FW: query pg_stat_ssl hang 100%cpu

On Thu, Sep 07, 2023 at 01:35:00AM +0000, James Pang (chaolpan) wrote:
> PGv14.8, OS RHEL8, no SSL enabled in this database, we have a
> lot of client sessions who check it's ssl state by query, all
> other sessions got done very quickly, but only 1 session hang
> there in 100% cpu tens of hours, even pg_terminate_backend does
> not make it stopped either. It looks like abnormal.
>
> select ssl from pg_stat_ssl where pid=pg_backend_pid();

This is hard to act on without more details or even a reproducible and self-contained test case. Even a java script based on the JDBC driver would be OK for me, for example, if it helps digging into what you are seeing.

> #0 ensure_record_cache_typmod_slot_exists (typmod=0) at
> typcache.c:1714
> #1 0x000000000091185b in assign_record_type_typmod
> (tupDesc=<optimized out>, tupDesc(at)entry=0x27bc738) at typcache.c:2001
> #2 0x000000000091df03 in internal_get_result_type (funcid=<optimized out>, call_expr=<optimized out>, rsinfo=<optimized out>,
> resultTypeId=<optimized out>, resultTupleDesc=0x7ffc9dff8cd0) at
> funcapi.c:393
> #3 0x000000000091e263 in get_expr_result_type (expr=expr(at)entry=0x2792798, resultTypeId=resultTypeId(at)entry=0x7ffc9dff8ccc,
> resultTupleDesc=resultTupleDesc(at)entry=0x7ffc9dff8cd0) at
> funcapi.c:230
> #4 0x00000000006a2fa5 in ExecInitFunctionScan
> (node=node(at)entry=0x273afa8, estate=estate(at)entry=0x269e948,
> eflags=eflags(at)entry=16) at nodeFunctionscan.c:370
> #5 0x000000000069084e in ExecInitNode (node=node(at)entry=0x273afa8,
> estate=estate(at)entry=0x269e948, eflags=eflags(at)entry=16) at
> execProcnode.c:255
> #6 0x000000000068a96d in InitPlan (eflags=16, queryDesc=0x273b2d8) at
> execMain.c:936
> #7 standard_ExecutorStart (queryDesc=0x273b2d8, eflags=16) at
> execMain.c:263
> #8 0x00007f67c2821d5d in pgss_ExecutorStart (queryDesc=0x273b2d8,
> eflags=<optimized out>) at pg_stat_statements.c:965
> #9 0x00000000007fc226 in PortalStart (portal=portal(at)entry=0x26848b8, params=params(at)entry=0x0, eflags=eflags(at)entry=0, snapshot=snapshot(at)entry=0x0)
> at pquery.c:514
> #10 0x00000000007fa27f in exec_bind_message
> (input_message=0x7ffc9dff90d0) at postgres.c:1995
> #11 PostgresMain (argc=argc(at)entry=1, argv=argv(at)entry=0x7ffc9dff9370,
> dbname=<optimized out>, username=<optimized out>) at postgres.c:4552
> #12 0x000000000077a4ea in BackendRun (port=<optimized out>,
> port=<optimized out>) at postmaster.c:4537
> #13 BackendStartup (port=<optimized out>) at postmaster.c:4259
> #14 ServerLoop () at postmaster.c:1745
> #15 0x000000000077b363 in PostmasterMain (argc=argc(at)entry=5,
> argv=argv(at)entry=0x256abc0) at postmaster.c:1417
> #16 0x00000000004fec63 in main (argc=5, argv=0x256abc0) at main.c:209

This stack is referring to a code path where we are checking that some of the type-related data associated to a record is around, but this does not say exactly where the loop happens, so... Are we looking on a loop in the function execution itself from which the information of pg_stat_ssl is retrieved (aka pg_stat_get_activity())? Or is the type cache somewhat broken because of the extended query protocol?
That's not really possible to see any evidence based on the information provided, though it provides a few hits that can help.
FWIW, I've not heard about an issue like that in the field.

The first thing I would do is update to 14.9, which is the latest version of Postgres available for this major version.
--
Michael

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Thomas Munro 2023-09-07 09:31:00 Re: FW: query pg_stat_ssl hang 100%cpu
Previous Message James Pang (chaolpan) 2023-09-07 08:46:29 RE: FW: query pg_stat_ssl hang 100%cpu

Browse pgsql-performance by date

  From Date Subject
Next Message Thomas Munro 2023-09-07 09:31:00 Re: FW: query pg_stat_ssl hang 100%cpu
Previous Message James Pang (chaolpan) 2023-09-07 08:46:29 RE: FW: query pg_stat_ssl hang 100%cpu