Re: shared-memory based stats collector

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: gkokolatos(at)protonmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: shared-memory based stats collector
Date: 2021-03-10 06:20:43
Message-ID: b36a7e0f-f4c7-51f5-497c-02e14213553e@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2021/03/10 12:10, Kyotaro Horiguchi wrote:
> At Tue, 9 Mar 2021 23:24:10 +0900, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote in
>>
>>
>> On 2021/03/09 16:51, Kyotaro Horiguchi wrote:
>>> At Sat, 6 Mar 2021 00:32:07 +0900, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
>>> wrote in
>>>> I don't think that we should treat non-zero exit condition as a crash,
>>>> as before. Otherwise when archive_command fails on a signal,
>>>> archiver emits FATAL error and which leads the server restart.
>>> Sounds reasonable. Now archiver is treated the same way to wal
>>> receiver. Specifically exit(1) doesn't cause server restart.
>>
>> Thanks!
>>
>> - if (PgArchStartupAllowed())
>> - PgArchPID = pgarch_start();
>>
>> In the latest patch, why did you remove the code to restart new archiver
>> in reaper()? When archiver dies, I think new archiver should be restarted like
>> the current reaper() does. Otherwise, the restart of archiver can be
>> delayed until the next cycle of ServerLoop, which may take time.
>
> Agreed. The code moved to the original place and added the crash
> handling code. And I added a phrase to the comment.
>
> + * Was it the archiver? If exit status is zero (normal) or one (FATAL
> + * exit), we assume everything is all right just like normal backends
> + * and just try to restart a new one so that we immediately retry
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> + * archiving of remaining files. (If fail, we'll try again in future
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~

"of" of "archiving of remaining" should be replaced with "the", or removed?

Just for record. Previously LogChildExit() was called and the following LOG
message was output when the archiver reported FATAL error. OTOH the patch
prevents that and the following LOG message is not output at FATAL exit of
archiver. But I don't think that the following message is required in that case
because FATAL message indicating the similar thing is already output.
Therefore, I'm ok with the patch.

LOG: archiver process (PID 46418) exited with exit code 1

>> I read v50_003 patch.
>>
>> When archiver dies, ProcGlobal->archiverLatch should be reset to NULL,
>> like walreceiver does the similar thing in WalRcvDie()?
>
> Differently from walwriter and checkpointer, archiver as well as
> walreceiver may die while server is running. Leaving the latch pointer
> alone may lead to nudging a wrong process that took over the same
> procarray slot. Added pgarch_die() to do that.

Thanks!

+ if (IsUnderPostmaster && ProcGlobal->archiverLatch)
+ SetLatch(ProcGlobal->archiverLatch);

The latch can be reset to NULL in pgarch_die() between the if-condition and
SetLatch(), and which would be problematic. Probably we should protect
the access to the latch by using spinlock, like we do for walreceiver's latch?

>
> (I moved the archiverLatch to just after checkpointerLatch in this version.)
>
>> In pgarch.c, #include "postmaster/fork_process.h" seems no longer necessary.
>
> Right. That's not due to this patch, postmater.h, dsm.h and pg_shmem.h
> are not used. (fd.h is not necessary but pgarch.c uses AllocateDir().)
>
>> + if (strcmp(argv[1], "--forkarch") == 0)
>> + {
>>
>> Why is this necessary? I was thinking that "--forkboot" handles archiver
>> in SubPostmasterMain().
>
> Yeah, the correspondent code is removed in the same patch at the same
> time.
>
> The attached is v51 patchset.

Thanks a lot!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2021-03-10 06:40:38 Re: Occasional tablespace.sql failures in check-world -jnn
Previous Message Michael Paquier 2021-03-10 05:55:05 pgsql: Move tablespace path re-creation from the makefiles to pg_regres