Re: Restart pg_usleep when interrupted

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: "Imseih (AWS), Sami" <samimseih(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Restart pg_usleep when interrupted
Date: 2024-08-17 23:12:22
Message-ID: CA+hUKG+f-nEc_SowDLW1JMUa6Of5sCK-JZ=v-KhL1xgXk83fiw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Aug 14, 2024 at 9:30 AM Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
> Another concern is the huge number of PqMsg_Progress messages sent by
> parallel workers with that approach. In Bertrand's tests, he was seeing
> nearly 350K interrupts for a ~19 minute vacuum (~300 interrupts per
> second). That seems a bit extreme to me. I don't see how anyone could
> possibly need stats about vacuum delays with that level of accuracy.

I suspect CF #5118 would fix lots of cases of ProcSignal() senders
going berserk, because it deletes SendProcSignal(), and introduces
SendInterrupt(), which calls SetLatch(), which doesn't send a signal if
the latch is already set. Even if the latch is not already set, it
only sends a signal if the latch is currently being waited on
("maybe_sleeping" flag). Even when it sends a signal, it goes to a
signalfd, kqueue or NT event flag on common platforms.

Of course that is only talking about the receiving side. I'm sure we
can improve the senders too. There's nothing we can do about NOTIFY,
because that's under user control, but that PqMsg_Progress case sounds
pretty bad, and the recovery conflict system could probably be made
more precise in its logic about who to wake up and when, etc.

Other backends going bananas with SendProcSignal() is the reason
dsm_impl_posix_resize() has to block signals while calling
posix_fallocate(). Unlike nanosleep(), which you can fix by tracking
remaining time, posix_fallocate() is all-or-nothing, it has no way to
report partial progress, so it must therefore undo its work if
interrupted, so its EINTR retry loop could get stuck forever when
other backends are trigger-happy with signals, which was a real
production issue. I guess both of these issues go away in practice if
CF #5118 goes in.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2024-08-17 23:27:36 Re: Restart pg_usleep when interrupted
Previous Message Joseph Koshakow 2024-08-17 21:52:48 Re: Remove dependence on integer wrapping