Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why is src/test/modules/committs/t/002_standby.pl flaky?
Date: 2022-01-12 07:00:00
Message-ID: d6633d6c-e547-52d1-0798-a7848583b501@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


> 11.01.2022 23:16, Tom Lane wrote:
>> Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
>>> Ouch. I think our options at this point are:
>>> 1. Revert 6051857fc (and put it back when we have a working
>>> long-lived WES as I showed). This is not very satisfying, now that we
>>> understand the bug, because even without that change I guess you must
>>> be able to reach the hanging condition by using Windows postgres_fdw
>>> to talk to a non-Windows server (ie a normal TCP stack with graceful
>>> shutdown/linger on process exit).
>> It'd be worth checking, perhaps. One thing I've been wondering all
>> along is how much of this behavior is specific to the local-loopback
>> case where Windows can see both ends of the connection. You'd think
>> that they couldn't long get away with such blatant violations of the
>> TCP specs when talking to external servers, because the failures
>> would be visible to everyone with a web browser.
I've split my test (both parts attached) and run it on two virtual
machines with clean builds from master (ac7c8075) on both (just the
debugging output added to connection.c). I provide probably redundant
info (also see attached screenshot) just to make sure that I didn't make
a mistake.
The excerpt from 001_disconnection1_local.log:
...
2022-01-12 09:29:48.099 MSK|Administrator|postgres|61de755a.a54|LOG: 
pgfdw_get_result: before WaitLatchOrSocket
2022-01-12 09:29:48.099
MSK|Administrator|postgres|61de755a.a54|STATEMENT:  SELECT * FROM large
WHERE a = fx2(a)
2022-01-12 09:29:48.100 MSK|Administrator|postgres|61de755a.a54|LOG: 
pgfdw_get_result: after WaitLatchOrSocket
2022-01-12 09:29:48.100
MSK|Administrator|postgres|61de755a.a54|STATEMENT:  SELECT * FROM large
WHERE a = fx2(a)
2022-01-12 09:29:48.100 MSK|Administrator|postgres|61de755a.a54|LOG: 
pgfdw_get_result: before WaitLatchOrSocket
2022-01-12 09:29:48.100
MSK|Administrator|postgres|61de755a.a54|STATEMENT:  SELECT * FROM large
WHERE a = fx2(a)
2022-01-12 09:29:48.100 MSK|Administrator|postgres|61de755a.a54|LOG: 
pgfdw_get_result: after WaitLatchOrSocket
2022-01-12 09:29:48.100
MSK|Administrator|postgres|61de755a.a54|STATEMENT:  SELECT * FROM large
WHERE a = fx2(a)
2022-01-12 09:29:48.100 MSK|Administrator|postgres|61de755a.a54|ERROR: 
FATAL:  terminating connection due to administrator command
    server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
2022-01-12 09:29:48.100
MSK|Administrator|postgres|61de755a.a54|CONTEXT:  remote SQL command:
FETCH 100 FROM c1
2022-01-12 09:29:48.100
MSK|Administrator|postgres|61de755a.a54|WARNING:  no connection to the
server
2022-01-12 09:29:48.100
MSK|Administrator|postgres|61de755a.a54|CONTEXT:  remote SQL command:
ABORT TRANSACTION
2022-01-12 09:29:48.107 MSK|Administrator|postgres|61de755a.a54|LOG: 
disconnection: session time: 0:00:01.577 user=Administrator
database=postgres host=127.0.0.1 port=49752
2022-01-12 09:29:48.257 MSK|[unknown]|[unknown]|61de755c.a4c|LOG: 
connection received: host=127.0.0.1 port=49754
2022-01-12 09:29:48.261 MSK|Administrator|postgres|61de755c.a4c|LOG: 
connection authenticated: identity="WIN-FCPSOVMM1JC\Administrator"
method=sspi
(C:/src/postgrespro/contrib/postgres_fdw/tmp_check/t_001_disconnection1_local_data/pgdata/pg_hba.conf:2)
2022-01-12 09:29:48.261 MSK|Administrator|postgres|61de755c.a4c|LOG: 
connection authorized: user=Administrator database=postgres
application_name=001_disconnection1.pl
2022-01-12 09:29:48.263 MSK|Administrator|postgres|61de755c.a4c|LOG: 
statement: SELECT * FROM large WHERE a = fx2(a)
2022-01-12 09:29:48.285 MSK|Administrator|postgres|61de755c.a4c|LOG: 
pgfdw_get_result: before WaitLatchOrSocket
2022-01-12 09:29:48.285
MSK|Administrator|postgres|61de755c.a4c|STATEMENT:  SELECT * FROM large
WHERE a = fx2(a)
...

By the look of things, you are right and this is the localhost-only issue.
I've rechecked that the test 001_disconnection.pl (local-loopback
version) hangs on both machines while 001_disconnection1.pl performs
successfully in both directions. I'm not sure whether the Windows client
and non-Windows server or reverse combinations are of interest in light
of the above.

Best regards,
Alexander

Attachment Content-Type Size
001_disconnection1.pl application/x-perl 1.4 KB
001_restart.pl application/x-perl 1.0 KB
Screenshot_20220112_093004.png image/png 192.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2022-01-12 07:00:51 Re: Proposal: allow database-specific role memberships
Previous Message Tom Lane 2022-01-12 06:58:40 Re: cfbot wrangling (was Re: Add checkpoint and redo LSN to LogCheckpointEnd log message)