From: | Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> |
---|---|
To: | Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com> |
Cc: | "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com |
Subject: | Re: About to add WAL write/fsync statistics to pg_stat_wal view |
Date: | 2021-03-12 03:39:22 |
Message-ID: | a965d3ed-47d3-7d30-c15f-fc0f98810e4d@oss.nttdata.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2021/03/11 21:29, Masahiro Ikeda wrote:
> On 2021-03-11 11:52, Fujii Masao wrote:
>> On 2021/03/11 9:38, Masahiro Ikeda wrote:
>>> On 2021-03-10 17:08, Fujii Masao wrote:
>>>> On 2021/03/10 14:11, Masahiro Ikeda wrote:
>>>>> On 2021-03-09 17:51, Fujii Masao wrote:
>>>>>> On 2021/03/05 8:38, Masahiro Ikeda wrote:
>>>>>>> On 2021-03-05 01:02, Fujii Masao wrote:
>>>>>>>> On 2021/03/04 16:14, Masahiro Ikeda wrote:
>>>>>>>>> On 2021-03-03 20:27, Masahiro Ikeda wrote:
>>>>>>>>>> On 2021-03-03 16:30, Fujii Masao wrote:
>>>>>>>>>>> On 2021/03/03 14:33, Masahiro Ikeda wrote:
>>>>>>>>>>>> On 2021-02-24 16:14, Fujii Masao wrote:
>>>>>>>>>>>>> On 2021/02/15 11:59, Masahiro Ikeda wrote:
>>>>>>>>>>>>>> On 2021-02-10 00:51, David G. Johnston wrote:
>>>>>>>>>>>>>>> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
>>>>>>>>>>>>>>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I pgindented the patches.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ... <function>XLogWrite</function>, which is invoked during an
>>>>>>>>>>>>>>> <function>XLogFlush</function> request (see ...). This is also
>>>>>>>>>>>>>>> incremented by the WAL receiver during replication.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ("which normally called" should be "which is normally called" or
>>>>>>>>>>>>>>> "which normally is called" if you want to keep true to the original)
>>>>>>>>>>>>>>> You missed the adding the space before an opening parenthesis here and
>>>>>>>>>>>>>>> elsewhere (probably copy-paste)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> is ether -> is either
>>>>>>>>>>>>>>> "This parameter is off by default as it will repeatedly query the
>>>>>>>>>>>>>>> operating system..."
>>>>>>>>>>>>>>> ", because" -> "as"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks, I fixed them.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> wal_write_time and the sync items also need the note: "This is also
>>>>>>>>>>>>>>> incremented by the WAL receiver during replication."
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I skipped changing it since I separated the stats for the WAL receiver
>>>>>>>>>>>>>> in pg_stat_wal_receiver.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> "The number of times it happened..." -> " (the tally of this event is
>>>>>>>>>>>>>>> reported in wal_buffers_full in....) This is undesirable because ..."
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks, I fixed it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I notice that the patch for WAL receiver doesn't require explicitly
>>>>>>>>>>>>>>> computing the sync statistics but does require computing the write
>>>>>>>>>>>>>>> statistics. This is because of the presence of issue_xlog_fsync but
>>>>>>>>>>>>>>> absence of an equivalent pg_xlog_pwrite. Additionally, I observe that
>>>>>>>>>>>>>>> the XLogWrite code path calls pgstat_report_wait_*() while the WAL
>>>>>>>>>>>>>>> receiver path does not. It seems technically straight-forward to
>>>>>>>>>>>>>>> refactor here to avoid the almost-duplicated logic in the two places,
>>>>>>>>>>>>>>> though I suspect there may be a trade-off for not adding another
>>>>>>>>>>>>>>> function call to the stack given the importance of WAL processing
>>>>>>>>>>>>>>> (though that seems marginalized compared to the cost of actually
>>>>>>>>>>>>>>> writing the WAL). Or, as Fujii noted, go the other way and don't have
>>>>>>>>>>>>>>> any shared code between the two but instead implement the WAL receiver
>>>>>>>>>>>>>>> one to use pg_stat_wal_receiver instead. In either case, this
>>>>>>>>>>>>>>> half-and-half implementation seems undesirable.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> OK, as Fujii-san mentioned, I separated the WAL receiver stats.
>>>>>>>>>>>>>> (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for updating the patches!
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and
>>>>>>>>>>>>>> the stats for WAL receiver is counted in pg_stat_wal_receiver.
>>>>>>>>>>>>>> What do you think?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On second thought, this idea seems not good. Because those stats are
>>>>>>>>>>>>> collected between multiple walreceivers, but other values in
>>>>>>>>>>>>> pg_stat_wal_receiver is only related to the walreceiver process running
>>>>>>>>>>>>> at that moment. IOW, it seems strange that some values show dynamic
>>>>>>>>>>>>> stats and the others show collected stats, even though they are in
>>>>>>>>>>>>> the same view pg_stat_wal_receiver. Thought?
>>>>>>>>>>>>
>>>>>>>>>>>> OK, I fixed it.
>>>>>>>>>>>> The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch.
>>>>>>>>>>>
>>>>>>>>>>> Thanks for updating the patches! I'm now reading 001 patch.
>>>>>>>>>>>
>>>>>>>>>>> + /* Check whether the WAL file was synced to disk right now */
>>>>>>>>>>> + if (enableFsync &&
>>>>>>>>>>> + (sync_method == SYNC_METHOD_FSYNC ||
>>>>>>>>>>> + sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
>>>>>>>>>>> + sync_method == SYNC_METHOD_FDATASYNC))
>>>>>>>>>>> + {
>>>>>>>>>>>
>>>>>>>>>>> Isn't it better to make issue_xlog_fsync() return immediately
>>>>>>>>>>> if enableFsync is off, sync_method is open_sync or open_data_sync,
>>>>>>>>>>> to simplify the code more?
>>>>>>>>>>
>>>>>>>>>> Thanks for the comments.
>>>>>>>>>> I added the above code in v12 patch.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> + /*
>>>>>>>>>>> + * Send WAL statistics only if WalWriterDelay has elapsed to minimize
>>>>>>>>>>> + * the overhead in WAL-writing.
>>>>>>>>>>> + */
>>>>>>>>>>> + if (rc & WL_TIMEOUT)
>>>>>>>>>>> + pgstat_send_wal();
>>>>>>>>>>>
>>>>>>>>>>> On second thought, this change means that it always takes wal_writer_delay
>>>>>>>>>>> before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
>>>>>>>>>>> For example, if wal_writer_delay is set to several seconds, some values in
>>>>>>>>>>> pg_stat_wal would be not up-to-date meaninglessly for those seconds.
>>>>>>>>>>> So I'm thinking to withdraw my previous comment and it's ok to send
>>>>>>>>>>> the stats every after XLogBackgroundFlush() is called. Thought?
>>>>>>>>>>
>>>>>>>>>> Thanks, I didn't notice that.
>>>>>>>>>>
>>>>>>>>>> Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
>>>>>>>>>> default value is 200msec and it may be set shorter time.
>>>>>>>>
>>>>>>>> Yeah, if wal_writer_delay is set to very small value, there is a risk
>>>>>>>> that the WAL stats are sent too frequently. I agree that's a problem.
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Why don't to make another way to check the timestamp?
>>>>>>>>>>
>>>>>>>>>> + /*
>>>>>>>>>> + * Don't send a message unless it's been at least
>>>>>>>>>> PGSTAT_STAT_INTERVAL
>>>>>>>>>> + * msec since we last sent one
>>>>>>>>>> + */
>>>>>>>>>> + now = GetCurrentTimestamp();
>>>>>>>>>> + if (TimestampDifferenceExceeds(last_report, now,
>>>>>>>>>> PGSTAT_STAT_INTERVAL))
>>>>>>>>>> + {
>>>>>>>>>> + pgstat_send_wal();
>>>>>>>>>> + last_report = now;
>>>>>>>>>> + }
>>>>>>>>>> +
>>>>>>>>>>
>>>>>>>>>> Although I worried that it's better to add the check code in pgstat_send_wal(),
>>>>>>>>
>>>>>>>> Agreed.
>>>>>>>>
>>>>>>>>>> I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
>>>>>>>>>> pgstat_send_wal() is invoked pg_report_stat() and it already checks the
>>>>>>>>>> PGSTAT_STAT_INTERVAL.
>>>>>>>>
>>>>>>>> I think that we can do that. What about the attached patch?
>>>>>>>
>>>>>>> Thanks, I thought it's better.
>>>>>>>
>>>>>>>
>>>>>>>>> I forgot to remove an unused variable.
>>>>>>>>> The attached v13 patch is fixed.
>>>>>>>>
>>>>>>>> Thanks for updating the patch!
>>>>>>>>
>>>>>>>> + w.wal_write,
>>>>>>>> + w.wal_write_time,
>>>>>>>> + w.wal_sync,
>>>>>>>> + w.wal_sync_time,
>>>>>>>>
>>>>>>>> It's more natural to put wal_write_time and wal_sync_time next to
>>>>>>>> each other? That is, what about the following order of columns?
>>>>>>>>
>>>>>>>> wal_write
>>>>>>>> wal_sync
>>>>>>>> wal_write_time
>>>>>>>> wal_sync_time
>>>>>>>
>>>>>>> Yes, I fixed it.
>>>>>>>
>>>>>>>> - case SYNC_METHOD_OPEN:
>>>>>>>> - case SYNC_METHOD_OPEN_DSYNC:
>>>>>>>> - /* write synced it already */
>>>>>>>> - break;
>>>>>>>>
>>>>>>>> IMO it's better to add Assert(false) here to ensure that we never reach
>>>>>>>> here, as follows. Thought?
>>>>>>>>
>>>>>>>> + case SYNC_METHOD_OPEN:
>>>>>>>> + case SYNC_METHOD_OPEN_DSYNC:
>>>>>>>> + /* not reachable */
>>>>>>>> + Assert(false);
>>>>>>>
>>>>>>> I agree.
>>>>>>>
>>>>>>>
>>>>>>>> Even when a backend exits, it sends the stats via pgstat_beshutdown_hook().
>>>>>>>> On the other hand, walwriter doesn't do that. Walwriter also should send
>>>>>>>> the stats even at its exit? Otherwise some stats can fail to be collected.
>>>>>>>> But ISTM that this issue existed from before, for example checkpointer
>>>>>>>> doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to fix
>>>>>>>> this issue in this patch?
>>>>>>>
>>>>>>> Thanks, I thought it's better to do so.
>>>>>>> I added the shutdown hook for the walwriter and the checkpointer in v14-0003 patch.
>>>>>>
>>>>>> Thanks for 0003 patch!
>>>>>>
>>>>>> Isn't it overkill to send the stats in the walwriter-exit-callback? IMO we can
>>>>>> just send the stats only when ShutdownRequestPending is true in the walwriter
>>>>>> main loop (maybe just before calling HandleMainLoopInterrupts()).
>>>>>> If we do this, we cannot send the stats when walwriter throws FATAL error.
>>>>>> But that's ok because FATAL error on walwriter causes the server to crash.
>>>>>> Thought?
>>>>>
>>>>> Thanks for your comments!
>>>>> Yes, I agree.
>>>>>
>>>>>
>>>>>> Also ISTM that we don't need to use the callback for that purpose in
>>>>>> checkpointer because of the same reason. That is, we can send the stats
>>>>>> just after calling ShutdownXLOG(0, 0) in HandleCheckpointerInterrupts().
>>>>>> Thought?
>>>>>
>>>>> Yes, I think so too.
>>>>>
>>>>> Since ShutdownXLOG() may create restartpoint or checkpoint,
>>>>> it might generate WAL records.
>>>>>
>>>>>
>>>>>> I'm now not sure how much useful these changes are. As far as I read pgstat.c,
>>>>>> when shutdown is requested, the stats collector seems to exit even when
>>>>>> there are outstanding stats messages. So if checkpointer and walwriter send
>>>>>> the stats in their last cycles, those stats might not be collected.
>>>>>>
>>>>>> On the other hand, I can think that sending the stats in the last cycles would
>>>>>> improve the situation a bit than now. So I'm inclined to apply those changes...
>>>>>
>>>>> I didn't notice that. I agree this is an important aspect.
>>>>> I understood there is a case that the stats collector exits before the checkpointer
>>>>> or the walwriter exits and some stats might not be collected.
>>>>
>>>> IIUC the stats collector basically exits after checkpointer and walwriter exit.
>>>> But there seems no guarantee that the stats collector processes
>>>> all the messages that other processes have sent during the shutdown of
>>>> the server.
>>>
>>> Thanks, I understood the above postmaster behaviors.
>>>
>>> PMState manages the status and after checkpointer is exited, the postmaster sends
>>> SIGQUIT signal to the stats collector if the shutdown mode is smart or fast.
>>> (IIUC, although the postmaster kill the walsender, the archiver and
>>> the stats collector at the same time, it's ok because the walsender
>>> and the archiver doesn't send stats to the stats collector now.)
>>>
>>> But, there might be a corner case to lose stats sent by background workers like
>>> the checkpointer before they exit (although this is not implemented yet.)
>>>
>>> For example,
>>>
>>> 1. checkpointer send the stats before it exit
>>> 2. stats collector receive the signal and break before processing
>>> the stats message from checkpointer. In this case, 1's message is lost.
>>> 3. stats collector writes the stats in the statsfiles and exit
>>>
>>> Why don't you recheck the coming message is zero just before the 2th procedure?
>>> (v17-0004-guarantee-to-collect-last-stats-messages.patch)
>>
>> Yes, I was thinking the same. This is the straight-forward fix for this issue.
>> The stats collector should process all the outstanding messages when
>> normal shutdown is requested, as the patch does. On the other hand,
>> if immediate shutdown is requested or emergency bailout (by postmaster death)
>> is requested, maybe the stats collector should skip those processings
>> and exit immediately.
>>
>> But if we implement that, we would need to teach the stats collector
>> the shutdown type (i.e., normal shutdown or immediate one). Because
>> currently SIGQUIT is sent to the collector whichever shutdown is requested,
>> and so the collector cannot distinguish the shutdown type. I'm afraid that
>> change is a bit overkill for now.
>>
>> BTW, I found that the collector calls pgstat_write_statsfiles() even at
>> emergency bailout case, before exiting. It's not necessary to save
>> the stats to the file in that case because subsequent server startup does
>> crash recovery and clears that stats file. So it's better to make
>> the collector exit immediately without calling pgstat_write_statsfiles()
>> at emergency bailout case? Probably this should be discussed in other
>> thread because it's different topic from the feature we're discussing here,
>> though.
>
> IIUC, only the stats collector has another hander for SIGQUIT although
> other background processes have a common hander for it and just call _exit(2).
> I thought to guarantee when TerminateChildren(SIGTERM) is invoked, don't make stats
> collector shutdown before other background processes are shutdown.
>
> I will make another thread to discuss that the stats collector should know the shutdown type or not.
> If it should be, it's better to make the stats collector exit as soon as possible if the shutdown type
> is an immediate, and avoid losing the remaining stats if it's normal.
>
>
>
>>> I measured the timing of the above in my linux laptop using v17-measure-timing.patch.
>>> I don't have any strong opinion to handle this case since this result shows to receive and processes
>>> the messages takes too short time (less than 1ms) although the stats collector receives the shutdown
>>> signal in 5msec(099->104) after the checkpointer process exits.
>>
>> Agreed.
>>
>>>
>>> ```
>>> 1615421204.556 [checkpointer] DEBUG: received shutdown request signal
>>> 1615421208.099 [checkpointer] DEBUG: proc_exit(-1): 0 callbacks to make # exit and send the messages
>>> 1615421208.099 [stats collector] DEBUG: process BGWRITER stats message # receive and process the messages
>>> 1615421208.099 [stats collector] DEBUG: process WAL stats message
>>> 1615421208.104 [postmaster] DEBUG: reaping dead processes
>>> 1615421208.104 [stats collector] DEBUG: received shutdown request signal # receive shutdown request from the postmaster
>>> ```
>>>
>>>>>> Of course, there is another direction; we can improve the stats collector so
>>>>>> that it guarantees to collect all the sent stats messages. But I'm afraid
>>>>>> this change might be big.
>>>>>
>>>>> For example, implement to manage background process status in shared memory and
>>>>> the stats collector collects the stats until another background process exits?
>>>>>
>>>>> In my understanding, the statistics are not required high accuracy,
>>>>> it's ok to ignore them if the impact is not big.
>>>>>
>>>>> If we guarantee high accuracy, another background process like autovacuum launcher
>>>>> must send the WAL stats because it accesses the system catalog and might generate
>>>>> WAL records due to HOT update even though the possibility is low.
>>>>>
>>>>> I thought the impact is small because the time uncollected stats are generated is
>>>>> short compared to the time from startup. So, it's ok to ignore the remaining stats
>>>>> when the process exists.
>>>>
>>>> I agree that it's not worth changing lots of code to collect such stats.
>>>> But if we can implement that very simply, isn't it more worth doing
>>>> that than current situation because we may be able to collect more
>>>> accurate stats.
>>>
>>> Yes, I agree.
>>> I attached the patch to send the stats before the wal writer and the checkpointer exit.
>>> (v17-0001-send-stats-for-walwriter-when-shutdown.patch, v17-0002-send-stats-for-checkpointer-when-shutdown.patch)
>>
>> Thanks for making those patches! Firstly I'm reading 0001 and 0002 patches.
>
> Thanks for your comments and for making patches.
>
>
>> Here is the review comments for 0001 patch.
>>
>> +/* Prototypes for private functions */
>> +static void HandleWalWriterInterrupts(void);
>>
>> HandleWalWriterInterrupts() and HandleMainLoopInterrupts() are almost the same.
>> So I don't think that we need to introduce HandleWalWriterInterrupts(). Instead,
>> we can just call pgstat_send_wal(true) before HandleMainLoopInterrupts()
>> if ShutdownRequestPending is true in the main loop. Attached is the patch
>> I implemented that way. Thought?
>
> I thought there is a corner case that can't send the stats like
>
> ```
> // First, ShutdownRequstPending = false
>
> if (ShutdownRequestPending) // don't send the stats
> pgstat_send_wal(true);
>
> // receive signal and ShutdownRequestPending became true
>
> HandleMainLoopInterrupts(); // proc exit without sending the stats
>
> ```
>
> Is it ok because it almost never occurs?
>
>
>> Here is the review comments for 0002 patch.
>>
>> +static void pgstat_send_checkpointer(void);
>>
>> I'm inclined to avoid adding the function with the prefix "pgstat_" outside
>> pgstat.c. Instead, I'm ok to just call both pgstat_send_bgwriter() and
>> pgstat_report_wal() directly after ShutdownXLOG(). Thought? Patch attached.
>
> Thanks. I agree.
>
>
>>>>> BTW, I found BgWriterStats.m_timed_checkpoints is not counted in ShutdownLOG()
>>>>> and we need to count it if to collect stats before it exits.
>>>>
>>>> Maybe m_requested_checkpoints should be incremented in that case?
>>>
>>> I thought this should be incremented
>>> because it invokes the methods with CHECKPOINT_IS_SHUTDOWN.
>>
>> Yes.
>
> OK, thanks.
>
>
>>>
>>> ```ShutdownXLOG()
>>> CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
>>> CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
>>> ```
>>>
>>> I fixed in v17-0002-send-stats-for-checkpointer-when-shutdown.patch.
>>>
>>>
>>> In addition, I rebased the patch for WAL receiver.
>>> (v17-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>
>> Thanks! Will review this later.
>
> Thanks a lot!
I read through the 0003 patch. Here are some comments for that.
With the patch, walreceiver's stats are counted as wal_write, wal_sync, wal_write_time and wal_sync_time in pg_stat_wal. But they should be counted as different columns because WAL IO is different between walreceiver and other processes like a backend? For example, open_sync or open_datasync is chosen as wal_sync_method, those other processes use O_DIRECT flag to open WAL files, but walreceiver does not. For example, those other procesess write WAL data in block units, but walreceiver does not. So I'm concerned that mixing different WAL IO stats in the same columns would confuse the users. Thought? I'd like to hear more opinions about how to expose walreceiver's stats to users.
+int
+XLogWriteFile(int fd, const void *buf, size_t nbyte, off_t offset)
This common function writes WAL data and measures IO timing. IMO we can refactor the code furthermore by making this function handle the case where pg_write() reports an error. In other words, I think that the function should do what do-while loop block in XLogWrite() does. Thought?
BTW, currently XLogWrite() increments IO timing even when pg_pwrite() reports an error. But this is useless. Probably IO timing should be incremented after the return code of pg_pwrite() is checked, instead?
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
From | Date | Subject | |
---|---|---|---|
Next Message | houzj.fnst@fujitsu.com | 2021-03-12 04:03:21 | RE: Parallel INSERT (INTO ... SELECT ...) |
Previous Message | Tatsuro Yamada | 2021-03-12 03:35:08 | Re: Huge memory consumption on partitioned table with FKs |