RE: BUG #17421: Core dump in ECPGdo() when calling PostgreSQL API from 32-bit client for RHEL8

From: "hirose(dot)masay-01(at)fujitsu(dot)com" <hirose(dot)masay-01(at)fujitsu(dot)com>
To: 'Tom Lane' <tgl(at)sss(dot)pgh(dot)pa(dot)us>, 'Michael Paquier' <michael(at)paquier(dot)xyz>
Cc: "'pgsql-bugs(at)lists(dot)postgresql(dot)org'" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: RE: BUG #17421: Core dump in ECPGdo() when calling PostgreSQL API from 32-bit client for RHEL8
Date: 2022-06-06 09:50:16
Message-ID: TYYPR01MB68109C3C722F18E2D070F350B8A29@TYYPR01MB6810.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hello,
I could not find out the root cause of this case. Instead I propose a workaround fix for ecpglib not to generate the core file in snprintf().
I expect the fix would avoid most of irregular cases. Could you review the fix and include it to the latest sources both of Postgres12.x and 14.x?
I explain the problem and the fix below:
[Problem]
The potential problem is no value is set to "message" after PQresultErrorField() is called in ecpg_raise_backend().
src\interfaces\ecpg\ecpglib\error.c
----------------------------
if (result)
{
sqlstate = PQresultErrorField(result, PG_DIAG_SQLSTATE);
if (sqlstate == NULL)
sqlstate = ECPG_SQLSTATE_ECPG_INTERNAL_ERROR;
message = PQresultErrorField(result, PG_DIAG_MESSAGE_PRIMARY);
}
----------------------------

I take another function as an example, ECPGnoticeReceiver() in connect.c, which has similar code.
The value "empty message text" is set to "message" just in case. I propose to add the same code to ecpg_raise_backend() as ECPGnoticeReceiver().
src\interfaces\ecpg\ecpglib\connect.c
----------------------------
static void
ECPGnoticeReceiver(void *arg, const PGresult *result)
{
char *sqlstate = PQresultErrorField(result, PG_DIAG_SQLSTATE);
char *message = PQresultErrorField(result, PG_DIAG_MESSAGE_PRIMARY);

if (sqlstate == NULL)
sqlstate = ECPG_SQLSTATE_ECPG_INTERNAL_ERROR;

if (message == NULL) /* Shouldn't happen, but need to be sure */
message = ecpg_gettext("empty message text");
----------------------------

[Fix]
The fix is to add 2 lines with changebar below in ecpg_raise_backend():
src\interfaces\ecpg\ecpglib\error.c (version12.1 line:237)
----------------------------
if (result)
{
sqlstate = PQresultErrorField(result, PG_DIAG_SQLSTATE);
if (sqlstate == NULL)
sqlstate = ECPG_SQLSTATE_ECPG_INTERNAL_ERROR;
message = PQresultErrorField(result, PG_DIAG_MESSAGE_PRIMARY);
| if (message == NULL)
| message = ecpg_gettext("empty message text");
}
----------------------------

[Test result]
With the fix, I confirmed the snprintf() did not generate the core and returned the sqlstate "YE000" as expected.

Regards,
Masa

-----Original Message-----
From: Hirose, Masayuki/廣世 政幸
Sent: Thursday, April 7, 2022 2:31 AM
To: 'Tom Lane' <tgl(at)sss(dot)pgh(dot)pa(dot)us>; 'Michael Paquier' <michael(at)paquier(dot)xyz>
Cc: 'pgsql-bugs(at)lists(dot)postgresql(dot)org' <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: RE: BUG #17421: Core dump in ECPGdo() when calling PostgreSQL API from 32-bit client for RHEL8

Hi,
Sorry for the late response. I report my test result using the client 12.10, formerly 12.10, and insight with the code trace.
I tested the same scenario to reproduce the issue using the postgreSQL client 12.10.
The issue was reproduced same as the previous version 12.1.
The frequency is as follows :
Client 12.1 : DB server down 30 times -> Core dump 7 times Client 12.10: DB server down 30 times -> Core dump 6 times

I cannot tell the detailed condition to reproduce the issue, however it looks that the issue occurs when the client issues SQL commit and simultaneously the Database server goes down.
Next, I tried to find out where the core is generated in ecpg_raise_backend code with debug log inserted.
In case the core is NOT generated - Case(a), ecpg_raise_backend() processed error information(sqlca) normally. In this case, my application retried the transaction.
In case the core is generated - Case(b), ecpg_raise_backend() did not process error information(sqlca) correctly but generated the core file.
The difference between Case(a) and Case(b) is as follows:
In case (a), the client detects communication error between Database server 20 seconds after it issued SQL commit. When ecpg_raise_backend is called, "status" in PGconn context was set as "CONNECTION_BAD(1)"
In case (b) of core dumped, the client detects communication error between Database server 2 minutes after it issued SQL commit. When ecpg_raise_backend is called, "status" in PGconn context was set as "CONNECTION_OK(0)".
Please check the following log message and the code trace in ecpg_raise_backend().
When "status" in the PGconn context is "CONNECTION_OK(0)" - Case(b), the core dump could occur in snprintf of ecpg_raise_backend(). In this scenario, once the "message" is NULL and the value is set unchanged until the snprintf is called.

Please note that Japanese log messages are replaced to "xxx"

Case(a) normal
--------------
Client log:
...
20:20:18 [...]: ecpg_execute on line 941: using PQexec
20:20:18 [...]: ecpg_process_output on line 941: OK: INSERT 0 1
20:20:18 [...]: ECPGtrans on line 716: action "commit work"; connection "0"
20:20:39 [...]: ECPGnoticeReceiver: xxx (18563)
20:20:39 [...]: raising sqlcode 0
20:20:39 [...]: ecpg_check_PQresult on line 941: bad response - server closed the connection unexpectedly
20:20:39 This probably means the server terminated abnormally
20:20:39 before or while processing the request.
20:20:39 [...]: raise_backend start: conn - server closed the connection unexpectedly
20:20:39 This probably means the server terminated abnormally
20:20:39 before or while processing the request.
20:20:39 [...]: raise_backend start: conn - 1
20:20:39 [...]: raise_backend: result not NULL
20:20:39 [...]: raise_backend: sqlstate NULL
20:20:39 [...]: raise_backend: message NULL
20:20:39 [...]: raise_backend: sqlstate INTERNAL_ERROR
20:20:39 [...]: ecpg_raise_backend sqlstate: 57P02 messase: the connection to the server was lost errno: 0
20:20:39 [...]: raising sqlstate 57P02 (sqlcode -400): the connection to the server was lost on line 941
20:20:39 [...]: ECPGtrans on line 768: action "rollback work"; connection "0"
20:20:39 [...]: ecpg_check_PQresult on line 768: no result - no connection to the server
20:20:39 [...]: raise_backend start: conn - no connection to the server
20:20:39 [...]: raise_backend start: conn - 1
20:20:39 [...]: raise_backend: result NULL
20:20:39 [...]: raise_backend: sqlstate INTERNAL_ERROR
20:20:39 [...]: ecpg_raise_backend sqlstate: 57P02 messase: the connection to the server was lost errno: 0
20:20:39 [...]: raising sqlstate 57P02 (sqlcode -400): the connection to the server was lost on line 768
20:20:39 [...]: ecpg_finish: connection 0 closed

DB server log:
...
20:20:31.227 ... PostgreSQL JDBC Driver) WARNING: 57P02: xxx (...) ...

Case(b) core dump
--------------

Client log:
...
18:38:17 [...]: ecpg_execute on line 561: using PQexec
18:38:17 [...]: ecpg_process_output on line 561: OK: CLOSE CURSOR
18:38:17 [...]: deallocate_one on line 562: name stmid
18:38:17 [...]: ECPGtrans on line 716: action "commit work"; connection "0"
18:40:15 [...]: ECPGnoticeReceiver: xxx (...)
18:40:15 [...]: raising sqlcode 0
18:40:15 [...]: ecpg_check_PQresult on line 941: bad response - server closed the connection unexpectedly
18:40:15 This probably means the server terminated abnormally
18:40:15 before or while processing the request.
18:40:15 [...]: raise_backend start: conn - server closed the connection unexpectedly
18:40:15 This probably means the server terminated abnormally
18:40:15 before or while processing the request.
18:40:15 [...]: raise_backend start: conn - 0
18:40:15 [...]: raise_backend: result not NULL
18:40:15 [...]: raise_backend: sqlstate NULL
18:40:15 [...]: raise_backend: message NULL
18:40:15 [...]: raise_backend: sqlstate INTERNAL_ERROR
18:40:16 // No subsequent log messages due to the core dump //

DB server log:
...
18:38:17.952 ... PostgreSQL JDBC Driver) WARNING: 57P02: xxx (...) ...

The ecpg_raise_backend code used with debug log:

void
ecpg_raise_backend(int line, PGresult *result, PGconn *conn, int compat) {
struct sqlca_t *sqlca = ECPGget_sqlca();
char *sqlstate;
char *message;

/* debug */
ecpg_log("raise_backend start: conn - %s", PQerrorMessage(conn));
ecpg_log("raise_backend start: conn - %d\n", PQstatus(conn));

...

if (result)
{
sqlstate = PQresultErrorField(result, PG_DIAG_SQLSTATE);
/* debug */
ecpg_log("raise_backend: result not NULL\n");
/* debug end */
if (sqlstate == NULL)
{ /* debug */
ecpg_log("raise_backend: sqlstate NULL\n");
/* debug end */
sqlstate = ECPG_SQLSTATE_ECPG_INTERNAL_ERROR;
} /* debug */
message = PQresultErrorField(result, PG_DIAG_MESSAGE_PRIMARY);
/* debug */
if (message == NULL)
{
ecpg_log("raise_backend: message NULL\n");
}
/* debug end */
}
else
{
/* debug */
ecpg_log("raise_backend: result NULL\n");
/* debug end */
sqlstate = ECPG_SQLSTATE_ECPG_INTERNAL_ERROR;
message = PQerrorMessage(conn);
}

if (strcmp(sqlstate, ECPG_SQLSTATE_ECPG_INTERNAL_ERROR) == 0)
{
/*
* we might get here if the connection breaks down, so let's check for
* this instead of giving just the generic internal error
*/
/* debug */
ecpg_log("raise_backend: sqlstate INTERNAL_ERROR\n");
/* debug end */
if (PQstatus(conn) == CONNECTION_BAD)
{
sqlstate = "57P02";
message = ecpg_gettext("the connection to the server was lost");
}
}

/* Debug start */
ecpg_log("ecpg_raise_backend sqlstate: %s messase: %s errno: %d\n", sqlstate, message, errno);
/* Debug end */

/* copy error message */
snprintf(sqlca->sqlerrm.sqlerrmc, sizeof(sqlca->sqlerrm.sqlerrmc), "%s on line %d", message, line);
sqlca->sqlerrm.sqlerrml = strlen(sqlca->sqlerrm.sqlerrmc);

...

Regards,
Masa

-----Original Message-----
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Sent: Sunday, March 6, 2022 12:19 AM
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Hirose, Masayuki/廣世 政幸 <hirose(dot)masay-01(at)fujitsu(dot)com>; pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17421: Core dump in ECPGdo() when calling PostgreSQL API from 32-bit client for RHEL8

Michael Paquier <michael(at)paquier(dot)xyz> writes:
> On Sat, Mar 05, 2022 at 06:45:21AM +0000, hirose(dot)masay-01(at)fujitsu(dot)com wrote:
>> My test case to reproduce the issue is:
>> 1. The client connects Postgres Database and issues SQL continuously.
>> 2. Switch the Database role from Active to Standby.
>> The Database is mirrored by the Mirroring Controller between two
>> clustered servers. the Mirroring Controller may be the original
>> feature added by the enterprise.

> A self-contained test case enters in the category of an ECPG script
> that we could use to reproduce the problem. Personally, I have no
> idea what kind of application stack you are using, and I don't know
> TJVvDatabaseAPI, which I suspect is a proprietary solution for
> something related to databases. The information you are providing
> here is not enough for one to know how to reproduce this problem.

"Switch from active to standby" isn't even possible in community Postgres, so there are definitely moving parts in this recipe that we are not responsible for or familiar with. Perhaps the problem can be reproduced with just stock Postgres, but nobody here is going to expend the effort to try to build a reproducer from this amount of information.

We have a wiki page offering advice about creating actionable problem reports:

https://wiki.postgresql.org/wiki/Guide_to_reporting_problems

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Robert Haas 2022-06-06 14:30:07 Re: BUG #17504: psql --single-transaction -vON_ERROR_STOP=1 still commits after client-side error
Previous Message PG Bug reporting form 2022-06-06 06:51:36 BUG #17510: cache lookup failed for type