Re: conchuela timeouts since 2021-10-09 system upgrade

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Geoghegan <pg(at)bowt(dot)ie>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: conchuela timeouts since 2021-10-09 system upgrade
Date: 2021-10-27 00:53:37
Message-ID: 186950.1635296017@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
> On Wed, Oct 27, 2021 at 3:29 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> [...] I'd be prepared to believe that prairiedog's
>> ancient macOS version has some weird bug preventing kevent() from noticing
>> available data ... but (a) surely conchuela wouldn't share such a bug,
>> and (b) we've been using kevent() for a couple years now, so how come
>> we didn't see this before?

> There was this case soon after our kqueue support landed:
> https://www.postgresql.org/message-id/CA%2BhUKGLzaR5cV0EmZWoVXJDO_XwZpmpQX_sYwCBRE1qLBEcGPQ%40mail.gmail.com

Oh yeah ... that looks like the exact same thing, doesn't it?

I've just finished doing 500 cycles of amcheck's 002_cic.pl without a
failure after recompiling latch.c with -DWAIT_USE_POLL. With the kevent
code path, it is hard to reproduce but not that hard. So unless we can
learn something from the DragonFly case, I'm inclined to write this off
as ancient-macOS-bug and start using -DWAIT_USE_POLL on prairiedog.

> I don't have any ideas about conchuela. For the next person who
> manages to reproduce this, just to sanity-check what we're passing in
> to kevent(), what do *port and waitfor look like when secure_read()
> blocks in WaitEventSetWait?

Not sure which variables you're referring to there, but here's a more
complete gdb dump from the stuck backend:

(gdb) bt
#0 0x9002ec88 in kevent ()
#1 0x003e3fa8 in WaitEventSetWait (set=0x3802258, timeout=-1, occurred_events=0xbfffe098, nevents=1, wait_event_info=0) at latch.c:1601
#2 0x0028d6a0 in secure_read (port=0x2501f40, ptr=0x7ad1d0, len=8192) at be-secure.c:186
#3 0x002958a4 in pq_recvbuf () at pqcomm.c:957
#4 0x00295994 in pq_getbyte () at pqcomm.c:1000
#5 0x004196a0 in PostgresMain (dbname=0x3805894 "", username=0x789020 "...") at postgres.c:353
#6 0x00363df8 in PostmasterMain (argc=8059652, argv=0x61789299) at postmaster.c:4560
#7 0x00298b6c in main (argc=4, argv=0x2500a30) at main.c:198
(gdb) f 1
#1 0x003e3fa8 in WaitEventSetWait (set=0x3802258, timeout=-1, occurred_events=0xbfffe098, nevents=1, wait_event_info=0) at latch.c:1601
1601 rc = kevent(set->kqueue_fd, NULL, 0,
(gdb) i locals
cur_event = (WaitEvent *) 0x78ed40
cur_kqueue_event = (struct kevent *) 0x0
timeout = {
tv_sec = 13,
tv_nsec = 0
}
rc = 4
timeout_p = (struct timespec *) 0x0
rc = 4
returned_events = 4
start_time = {
tv_sec = -1879041144,
tv_usec = 58742932
}
cur_time = {
tv_sec = 59028992,
tv_usec = 2
}
cur_timeout = -1
(gdb) p *set
$1 = {
nevents = 3,
nevents_space = 3,
events = 0x380227c,
latch = 0xc954ebc,
latch_pos = 1,
exit_on_postmaster_death = 0 '\0',
kqueue_fd = 4,
kqueue_ret_events = 0x38022ac,
report_postmaster_not_running = 0 '\0'
}
(gdb) p set->events[0]
$2 = {
pos = 0,
events = 2,
fd = 7, # matches MyProcPort->sock
user_data = 0x0
}
(gdb) p set->events[1]
$3 = {
pos = 1,
events = 1,
fd = -1,
user_data = 0x0
}
(gdb) p set->events[2]
$4 = {
pos = 2,
events = 16,
fd = 3,
user_data = 0x0
}
(gdb) p set->kqueue_ret_events[0]
$6 = {
ident = 7, # matches MyProcPort->sock
filter = -1, # EVFILT_READ
flags = 1, # EV_ADD
fflags = 0,
data = 40,
udata = 0x380227c
}

So AFAICS, the data we're passing to kevent() is all sane.

It then occurred to me to replicate the case I saw this morning
(control-C'ing the test script) and ktrace all the involved processes.
The perl test script and the pgbench process both just immediately
die on SIGINT:

25500 pgbench RET select -1 errno 4 Interrupted system call
25500 pgbench PSIG SIGINT SIG_DFL

so they are clearly not doing anything to cause the DROP command
to get delivered. Nonetheless, what the backend sees post-control-C is

25502 postgres RET kevent 1
25502 postgres CALL recvfrom(0x7,0x7ad1d0,0x2000,0,0,0)
25502 postgres GIO fd 7 wrote 34 bytes
"Q\0\0\0!DROP INDEX CONCURRENTLY idx;\0"
25502 postgres RET recvfrom 34/0x22
... blah, blah, executing the command ...
25502 postgres CALL sendto(0x7,0x241d020,0x16,0,0,0)
25502 postgres RET sendto -1 errno 32 Broken pipe
... blah, blah, shutting down ...

It's really hard to look at this and not call it a kernel bug.

regards, tom lane

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Peter Geoghegan 2021-10-27 01:45:17 Re: ERROR: posting list tuple with 20 items cannot be split at offset 168
Previous Message Peter Geoghegan 2021-10-26 22:23:49 Re: BUG #17245: Index corruption involving deduplicated entries