Re: Why is citext/regress failing on hamerkop?

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why is citext/regress failing on hamerkop?
Date: 2024-05-14 09:00:00
Message-ID: 17112876-887c-43aa-e81f-2ec01e3c776f@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

13.05.2024 23:17, Tom Lane wrote:
> 3. We don't know exactly why hamerkop suddenly started seeing these
> failures, but a plausible theory emerges after noting that its
> reported time for the successful "make check" step dropped pretty
> substantially right when this started. In the v13 branch, "make
> check" was taking 2:18 or more in the several runs right before the
> first isolationcheck failure, but 1:40 or less just after. So it
> looks like the animal was moved onto faster hardware. That feeds
> into this problem because, with a successful isolationcheck run
> taking more than a minute, there was enough time for some of the
> earlier stuck sessions to time out and exit before their
> postmaster-child slots were needed.

Yes, and one thing I can't explain yet, is why REL_14_STABLE+ timings
substantially differ from REL_13_STABLE-, say, for the check stage:
REL_14_STABLE: the oldest available test log (from 2021-10-30) shows
check (00:03:47) and the newest one (from 2024-05-12): check (00:03:18).
Whilst on REL_13_STABLE the oldest available log (from 2021-08-06) shows
check duration 00:03:00, then it decreased to 00:02:24 (2021-09-22)/
00:02:14 (2021-11-07), and now it's 1:40, as you said.

Locally I see more or less the same timings on REL_13_STABLE (34, 28, 27
secs) and on REL_14_STABLE (33, 29, 29 secs).

14.05.2024 03:38, Thomas Munro wrote:
> I was beginning to suspect that lingering odour myself. I haven't
> look at the GSS code but I was imagining that what we have here is
> perhaps not unsent data dropped on the floor due to linger policy
> (unclean socket close on process exist), but rather that the server
> didn't see the socket as ready to read because it lost track of the
> FD_CLOSE somewhere because the client closed it gracefully, and our
> server-side FD_CLOSE handling has always been a bit suspect. I wonder
> if the GSS code is somehow more prone to brokenness. One thing we
> learned in earlier problems was that abortive/error disconnections
> generate FD_CLOSE repeatedly, while graceful ones give you only one.
> In other words, if the other end politely calls closesocket(), the
> server had better not miss the FD_CLOSE event, because it won't come
> again. That's what
>
> https://commitfest.postgresql.org/46/3523/
>
> is intended to fix. Does it help here? Unfortunately that's
> unpleasantly complicated and unbackpatchable (keeping a side-table of
> socket FDs and event handles, so we don't lose events between the
> cracks).

Yes, that cure helps here too. I've tested it on b282fa88d~1 (the last
state when that patch set can be applied).

An excerpt (all lines related to process 12500) from a failed run log
without the patch set:
2024-05-14 05:57:29.526 UTC [8228:128] DEBUG:  forked new backend, pid=12500 socket=5524
2024-05-14 05:57:29.534 UTC [12500:1] [unknown] LOG:  connection received: host=::1 port=51394
2024-05-14 05:57:29.534 UTC [12500:2] [unknown] LOG: !!!BackendInitialize| before ProcessStartupPacket
2024-05-14 05:57:29.534 UTC [12500:3] [unknown] LOG: !!!ProcessStartupPacket| before secure_open_gssapi(), GSSok: G
2024-05-14 05:57:29.534 UTC [12500:4] [unknown] LOG: !!!secure_open_gssapi| before read_or_wait
2024-05-14 05:57:29.534 UTC [12500:5] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0, len: 4
2024-05-14 05:57:29.534 UTC [12500:6] [unknown] LOG: !!!read_or_wait| after secure_raw_read: -1, errno: 10035
2024-05-14 05:57:29.534 UTC [12500:7] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket()
2024-05-14 05:57:29.549 UTC [12500:8] [unknown] LOG: !!!read_or_wait| after WaitLatchOrSocket
2024-05-14 05:57:29.549 UTC [12500:9] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0, len: 4
2024-05-14 05:57:29.549 UTC [12500:10] [unknown] LOG: !!!read_or_wait| after secure_raw_read: 0, errno: 10035
2024-05-14 05:57:29.549 UTC [12500:11] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket()
...
2024-05-14 05:57:52.024 UTC [8228:3678] DEBUG:  server process (PID 12500) exited with exit code 1
# at the end of the test run

And an excerpt (all lines related to process 11736) from a successful run
log with the patch set applied:
2024-05-14 06:03:57.216 UTC [4524:130] DEBUG:  forked new backend, pid=11736 socket=5540
2024-05-14 06:03:57.226 UTC [11736:1] [unknown] LOG:  connection received: host=::1 port=51914
2024-05-14 06:03:57.226 UTC [11736:2] [unknown] LOG: !!!BackendInitialize| before ProcessStartupPacket
2024-05-14 06:03:57.226 UTC [11736:3] [unknown] LOG: !!!ProcessStartupPacket| before secure_open_gssapi(), GSSok: G
2024-05-14 06:03:57.226 UTC [11736:4] [unknown] LOG: !!!secure_open_gssapi| before read_or_wait
2024-05-14 06:03:57.226 UTC [11736:5] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0, len: 4
2024-05-14 06:03:57.226 UTC [11736:6] [unknown] LOG: !!!read_or_wait| after secure_raw_read: -1, errno: 10035
2024-05-14 06:03:57.226 UTC [11736:7] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket()
2024-05-14 06:03:57.240 UTC [11736:8] [unknown] LOG: !!!read_or_wait| after WaitLatchOrSocket
2024-05-14 06:03:57.240 UTC [11736:9] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0, len: 4
2024-05-14 06:03:57.240 UTC [11736:10] [unknown] LOG: !!!read_or_wait| after secure_raw_read: 0, errno: 10035
2024-05-14 06:03:57.240 UTC [11736:11] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket()
2024-05-14 06:03:57.240 UTC [11736:12] [unknown] LOG: !!!read_or_wait| after WaitLatchOrSocket
2024-05-14 06:03:57.240 UTC [11736:13] [unknown] LOG: !!!secure_open_gssapi| read_or_wait returned -1
2024-05-14 06:03:57.240 UTC [11736:14] [unknown] LOG: !!!ProcessStartupPacket| secure_open_gssapi() returned error
2024-05-14 06:03:57.240 UTC [11736:15] [unknown] LOG: !!!BackendInitialize| after ProcessStartupPacket
2024-05-14 06:03:57.240 UTC [11736:16] [unknown] LOG: !!!BackendInitialize| status: -1
2024-05-14 06:03:57.240 UTC [11736:17] [unknown] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make
2024-05-14 06:03:57.240 UTC [11736:18] [unknown] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make
2024-05-14 06:03:57.240 UTC [11736:19] [unknown] DEBUG: proc_exit(0): 1 callbacks to make
2024-05-14 06:03:57.240 UTC [11736:20] [unknown] DEBUG:  exit(0)
2024-05-14 06:03:57.240 UTC [11736:21] [unknown] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2024-05-14 06:03:57.240 UTC [11736:22] [unknown] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2024-05-14 06:03:57.240 UTC [11736:23] [unknown] DEBUG: proc_exit(-1): 0 callbacks to make
2024-05-14 06:03:57.243 UTC [4524:132] DEBUG:  forked new backend, pid=10536 socket=5540
2024-05-14 06:03:57.243 UTC [4524:133] DEBUG:  server process (PID 11736) exited with exit code 0

Best regards,
Alexander

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2024-05-14 09:45:28 Re: [PATCH] Fix bug when calling strncmp in check_authmethod_valid
Previous Message Peter Eisentraut 2024-05-14 08:39:36 Re: pgsql: Fix overread in JSON parsing errors for incomplete byte sequence