Re: when the startup process doesn't (logging startup delays)

From: Nitin Jadhav <nitinjadhavpostgres(at)gmail(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: when the startup process doesn't (logging startup delays)
Date: 2021-09-22 14:29:17
Message-ID: CAMm1aWZqQNdjawULasOPr0Kt0JaweRoDF2W1353O7VUraxisVA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> Michael is right. You updated some of the units based on Robert's suggestion
> to use MS, but didn't update all of the corresponding parts of the patch.
> guc.c says that the units are in MS, which means that unqualified values are
> interpretted as such. But postgresql.conf.sample still says "seconds", and
> guc.c says the default value is "10", and you still do:
>
> + interval_in_ms = log_startup_progress_interval * 1000;
>
> I checked that this currently does not interpret the value as ms:
> |./tmp_install/usr/local/pgsql/bin/postgres -D src/test/regress/tmp_check/data/ -c log_startup_progress_interval=1
> |2021-09-07 06:28:58.694 CDT startup[18865] LOG: redo in progress, elapsed time: 1.00 s, current LSN: 0/E94ED88
> |2021-09-07 06:28:59.694 CDT startup[18865] LOG: redo in progress, elapsed time: 2.00 s, current LSN: 0/10808EE0
> |2021-09-07 06:29:00.694 CDT startup[18865] LOG: redo in progress, elapsed time: 3.00 s, current LSN: 0/126B8C80
>
> (Also, the GUC value is in the range 0..INT_MAX, so multiplying and storing to
> another int could overflow.)
>
> I think the convention is to for GUC global vars to be initialized with the
> same default as in guc.c, so both should be 10000, like:
>
> +int log_startup_progress_interval = 10 * 1000 /* 10sec */

Thanks Justin for the detailed explanation. Done the necessary changes.

Please find the updated patch attached.

On Mon, Sep 13, 2021 at 8:32 PM Nitin Jadhav
<nitinjadhavpostgres(at)gmail(dot)com> wrote:
>
> > I think you're confusing discussions.
> >
> > Robert was talking about initdb/bootstrap/single, and I separately and
> > independently asked about hot standbys. It seems like Robert and I agreed
> > about the desired behavior and there was no further discussion.
>
> Sorry for including 2 separate points into one.
>
> > Looking over this version, I realized something I (or you) should have
> > noticed sooner: you've added the RegisterTimeout call to
> > InitPostgres(), but that's for things that are used by all backends,
> > and this is only used by the startup process. So it seems to me that
> > the right place is StartupProcessMain. That would have the further
> > advantage of allowing startup_progress_timeout_handler to be made
> > static. begin_startup_progress_phase() and
> > startup_progress_timeout_has_expired() are the actual API functions
> > though so they will need to remain extern.
>
> The earlier discussion wrt this point is as follows.
>
> > > I also agree that this is the better place to do it. Hence modified
> > > the patch accordingly. The condition "!AmStartupProcess()" is added to
> > > differentiate whether the call is done from a startup process or some
> > > other process. Actually StartupXLOG() gets called in 2 places. one in
> > > StartupProcessMain() and the other in InitPostgres(). As the logging
> > > of the startup progress is required only during the startup process
> > > and not in the other cases,
> >
> > The InitPostgres() case occurs when the server is started in bootstrap
> > mode (during initdb) or in single-user mode (postgres --single). I do
> > not see any reason why we shouldn't produce progress messages in at
> > least the latter case. I suspect that someone who is in the rather
> > desperate scenario of having to use single-user mode would really like
> > to know how long the server is going to take to start up.
> >
> > Perhaps during initdb we don't want messages, but I'm not sure that we
> > need to do anything about that here. None of the messages that the
> > server normally produces show up when you run initdb, so I guess they
> > are getting redirected to /dev/null or something.
> >
> > So I don't think that using AmStartupProcess() for this purpose is right.
>
> This point is really confusing. As per the earlier discussion we
> concluded to include RegisterTimeout() call even in case of
> InitPostgres() to support logging in case of single-user mode. Now if
> we remove the RegisterTimeout() call from InitPostgres(), we are not
> going to support that anymore. Is this what you're trying to convey?
> or we should add some checks and disable the code to RegisterTimeout()
> if it is other than single-user mode. I have added a check if
> (!IsPostmasterEnvironment) in the attached patch for this scenario.
> Kindly confirm my understanding.
>
> > > Should this feature distinguish between crash recovery and archive recovery on
> > > a hot standby ? Otherwise the standby will display this all the time.
> > >
> > >2021-08-14 16:13:33.139 CDT startup[11741] LOG: redo in progress, elapsed time: 124.42 s, current LSN: 0/EEE2100
> > >
> > >If so, I think maybe you'd check !InArchiveRecovery (but until Robert finishes
> > > cleanup of xlog.c variables, I can't say that with much confidence).
> >
> > Hmm. My inclination is to think that on an actual standby, you
> > wouldn't want to get messages like this, but if you were doing a
> > point-in-time-recovery, then you would. So I think maybe
> > InArchiveRecovery is not the right thing. Perhaps StandbyMode?
>
> I also feel that the log messages should be recorded in case of
> point-in-time-recovery. So I have added a check if (!StandbyMode) and
> verified the replication and point-in-time-recovery scenario.
>
> > > Some doc comments:
> >
> > Thanks for the suggestions. I will take care in the next patch.
>
> Fixed.
>
> > Michael is right. You updated some of the units based on Robert's suggestion
> > to use MS, but didn't update all of the corresponding parts of the patch.
> > guc.c says that the units are in MS, which means that unqualified values are
> > interpretted as such. But postgresql.conf.sample still says "seconds", and
> > guc.c says the default value is "10", and you still do:
> >
> > + interval_in_ms = log_startup_progress_interval * 1000;
> >
> > I checked that this currently does not interpret the value as ms:
> > |./tmp_install/usr/local/pgsql/bin/postgres -D src/test/regress/tmp_check/data/ -c log_startup_progress_interval=1
> > |2021-09-07 06:28:58.694 CDT startup[18865] LOG: redo in progress, elapsed time: 1.00 s, current LSN: 0/E94ED88
> > |2021-09-07 06:28:59.694 CDT startup[18865] LOG: redo in progress, elapsed time: 2.00 s, current LSN: 0/10808EE0
> > |2021-09-07 06:29:00.694 CDT startup[18865] LOG: redo in progress, elapsed time: 3.00 s, current LSN: 0/126B8C80
> >
> > (Also, the GUC value is in the range 0..INT_MAX, so multiplying and storing to
> > another int could overflow.)
> >
> > I think the convention is to for GUC global vars to be initialized with the
> > same default as in guc.c, so both should be 10000, like:
> >
> > +int log_startup_progress_interval = 10 * 1000 /* 10sec */
>
> Following is the discussion done wrt this point. Kindly refer and
> share your thoughts.
>
> > > > I suggest making the GUC GUC_UNIT_MS rather than GUC_UNIT_S, but
> > > > expressing the default in postgresl.conf.sample as 10s rather than
> > > > 10000ms. I tried values measured in milliseconds just for testing
> > > > purposes and didn't initially understand why it wasn't working. I
> > > > don't think there's any reason it can't work.
> > >
> > > As suggested, I have changed it to GUC_UNIT_MS and kept the default
> > > value to 10s. I would like to know the reason why it can't be
> > > GUC_UNIT_S as we are expressing the values in terms of seconds.
> >
> > I mean, it *could* be. There's just no advantage. Values in seconds
> > will work correctly either way. But values in milliseconds will only
> > work if it's GUC_UNIT_MS. It seems to me that it's better to make more
> > things work rather than fewer.
>
> Thanks & Regards,
> Nitin Jadhav
> On Tue, Sep 7, 2021 at 5:19 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> >
> > On Tue, Sep 07, 2021 at 03:07:15PM +0530, Nitin Jadhav wrote:
> > > > Looking over this version, I realized something I (or you) should have
> > > > noticed sooner: you've added the RegisterTimeout call to
> > > > InitPostgres(), but that's for things that are used by all backends,
> > > > and this is only used by the startup process. So it seems to me that
> > > > the right place is StartupProcessMain. That would have the further
> > > > advantage of allowing startup_progress_timeout_handler to be made
> > > > static. begin_startup_progress_phase() and
> > > > startup_progress_timeout_has_expired() are the actual API functions
> > > > though so they will need to remain extern.
> > > >
> > > > I agree with Robert that a standby server should not continuously show timing
> > > > messages for WAL replay.
> > >
> > > The earlier discussion wrt this point is as follows.
> >
> > I think you're confusing discussions.
> >
> > Robert was talking about initdb/bootstrap/single, and I separately and
> > independently asked about hot standbys. It seems like Robert and I agreed
> > about the desired behavior and there was no further discussion.
> >
> > > > Honestly, I don't see why we should have
> > > > a GUC for what's proposed here. A value too low would bloat the logs
> > > > with entries that are not that meaningful. A value too large would
> > > > just prevent access to some information wanted. Wouldn't it be better
> > > > to just pick up a value like 10s or 20s?
> >
> > I don't think bloating logs is a issue for values > 10sec.
> >
> > You agreed that it's important to choose the "right" value, but I think that
> > will vary between users. Some installations with large checkpoint_timeout
> > might anticipate taking 15+min to perform recovery, but others might even have
> > a strict requirement that recovery must not take more than (say) 10sec; someone
> > might want to use this to verify that, or to optimize the slow parts of
> > recovery, with an interval that someone else might not care about.
> >
> > > > + {"log_startup_progress_interval", PGC_SIGHUP, LOGGING_WHEN,
> > > > + gettext_noop("Sets the time interval between each progress update "
> > > > + "of the startup process."),
> > > > + gettext_noop("0 logs all messages. -1 turns this feature off."),
> > > > + GUC_UNIT_MS,
> > |+ 10, -1, INT_MAX,
> > > > The unit is incorrect here, as that would default to 10ms, contrary to
> > > > what the documentation says about 10s.
> > >
> > > Kindly refer the previous few discussions wrt this point.
> >
> > You copied and pasted unrelated emails, which isn't helpful.
> >
> > Michael is right. You updated some of the units based on Robert's suggestion
> > to use MS, but didn't update all of the corresponding parts of the patch.
> > guc.c says that the units are in MS, which means that unqualified values are
> > interpretted as such. But postgresql.conf.sample still says "seconds", and
> > guc.c says the default value is "10", and you still do:
> >
> > + interval_in_ms = log_startup_progress_interval * 1000;
> >
> > I checked that this currently does not interpret the value as ms:
> > |./tmp_install/usr/local/pgsql/bin/postgres -D src/test/regress/tmp_check/data/ -c log_startup_progress_interval=1
> > |2021-09-07 06:28:58.694 CDT startup[18865] LOG: redo in progress, elapsed time: 1.00 s, current LSN: 0/E94ED88
> > |2021-09-07 06:28:59.694 CDT startup[18865] LOG: redo in progress, elapsed time: 2.00 s, current LSN: 0/10808EE0
> > |2021-09-07 06:29:00.694 CDT startup[18865] LOG: redo in progress, elapsed time: 3.00 s, current LSN: 0/126B8C80
> >
> > (Also, the GUC value is in the range 0..INT_MAX, so multiplying and storing to
> > another int could overflow.)
> >
> > I think the convention is to for GUC global vars to be initialized with the
> > same default as in guc.c, so both should be 10000, like:
> >
> > +int log_startup_progress_interval = 10 * 1000 /* 10sec */
> >
> > --
> > Justin

Attachment Content-Type Size
v16-0001-startup-process-progress.patch application/octet-stream 14.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2021-09-22 14:50:01 Re: Refactoring postgres_fdw code to rollback remote transaction
Previous Message Jonathan S. Katz 2021-09-22 14:18:51 Re: PostgreSQL 14 press release draft