From: | "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Fabrice Chapuis <fabrice636861(at)gmail(dot)com>, "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | RE: Logical replication timeout problem |
Date: | 2022-01-21 09:51:27 |
Message-ID: | OS3PR01MB62751F9FC898C73E638E36489E5B9@OS3PR01MB6275.jpnprd01.prod.outlook.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Jan 20, 2022 at 9:18 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> It might be not reaching the actual send_keep_alive logic in
> WalSndKeepaliveIfNecessary because of below code:
> {
> ...
> /*
> * Don't send keepalive messages if timeouts are globally disabled or
> * we're doing something not partaking in timeouts.
> */
> if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0) return; ..
> }
>
> I think you can add elog before the above return and before updating progress
> in the below code:
> case REORDER_BUFFER_CHANGE_INSERT:
> if (!relentry->pubactions.pubinsert)
> + {
> + OutputPluginUpdateProgress(ctx);
> return;
>
> This will help us to rule out one possibility.
Thanks for your advices!
According to your advices, I applied 0001,0002 and 0003 to run the test script.
When subscriber timeout, I filter publisher-side log:
$ grep "before invoking update progress" pub.log | wc -l
60373557
$ grep "return because wal_sender_timeout or last_reply_timestamp" pub.log | wc -l
0
$ grep "return because waiting_for_ping_response" pub.log | wc -l
0
Based on this result, I think function WalSndKeepaliveIfNecessary was invoked,
but function WalSndKeepalive was not invoked because (last_processing >=
ping_time) is false.
So I tried to see changes about last_processing and last_reply_timestamp
(because ping_time is based on last_reply_timestamp).
I found last_processing and last_reply_timestamp is set in function
ProcessRepliesIfAny.
last_processing is set to the time when function ProcessRepliesIfAny is
invoked.
Only when publisher receive a response from subscriber, last_reply_timestamp is
set to last_processing and the flag waiting_for_ping_response is reset to
false.
When we are during the loop to skip all the changes of transaction, IIUC, we do
not invoke function ProcessRepliesIfAny. So I think last_processing and
last_reply_timestamp will not be changed in this loop.
Therefore I think about our use case, we should modify the condition of
invoking WalSndKeepalive.(please refer to
0004-Simple-modification-of-timing.patch, and note that this is only a patch
for testing).
At the same time I modify the input of WalSndKeepalive from true to false. This
is because when input is true, waiting_for_ping_response is set to true in
WalSndKeepalive. As mentioned above, ProcessRepliesIfAny is not invoked in the
loop, so I think waiting_for_ping_response will not be reset to false and
keepalive messages will not be sent.
I tested after applying patches(0001 and 0004), I found the timeout was not
printed in subscriber-side log. And the added messages "begin load changes" and
"commit the log" were printed in publisher-side log:
$ grep -ir "begin load changes" pub.log
2022-01-21 11:17:06.934 CST [2577699] LOG: begin load changes
$ grep -ir "commit the log" pub.log
2022-01-21 11:21:15.564 CST [2577699] LOG: commit the log
Attach the patches and test script mentioned above, in case someone wants to
try.
Regards,
Wang wei
Attachment | Content-Type | Size |
---|---|---|
0001-Send-keepalive.patch | application/octet-stream | 1.7 KB |
0002-Add-some-logs-to-debug.patch | application/octet-stream | 1.3 KB |
0003-Add-more-logs.patch | application/octet-stream | 1.5 KB |
0004-Simple-modification-of-timing.patch | application/octet-stream | 2.2 KB |
test.sh | application/octet-stream | 1.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Dilip Kumar | 2022-01-21 10:27:02 | Re: row filtering for logical replication |
Previous Message | Peter Eisentraut | 2022-01-21 09:44:24 | Re: ICU for global collation |