From: | Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> |
---|---|
To: | Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com> |
Cc: | "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com |
Subject: | Re: About to add WAL write/fsync statistics to pg_stat_wal view |
Date: | 2021-03-08 15:48:00 |
Message-ID: | c4e1ebb8-6cd7-ebbf-a770-c65ea54239ce@oss.nttdata.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2021/03/08 19:42, Masahiro Ikeda wrote:
> On 2021-03-08 13:44, Fujii Masao wrote:
>> On 2021/03/05 19:54, Masahiro Ikeda wrote:
>>> On 2021-03-05 12:47, Fujii Masao wrote:
>>>> On 2021/03/05 8:38, Masahiro Ikeda wrote:
>>>>> On 2021-03-05 01:02, Fujii Masao wrote:
>>>>>> On 2021/03/04 16:14, Masahiro Ikeda wrote:
>>>>>>> On 2021-03-03 20:27, Masahiro Ikeda wrote:
>>>>>>>> On 2021-03-03 16:30, Fujii Masao wrote:
>>>>>>>>> On 2021/03/03 14:33, Masahiro Ikeda wrote:
>>>>>>>>>> On 2021-02-24 16:14, Fujii Masao wrote:
>>>>>>>>>>> On 2021/02/15 11:59, Masahiro Ikeda wrote:
>>>>>>>>>>>> On 2021-02-10 00:51, David G. Johnston wrote:
>>>>>>>>>>>>> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
>>>>>>>>>>>>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I pgindented the patches.
>>>>>>>>>>>>>
>>>>>>>>>>>>> ... <function>XLogWrite</function>, which is invoked during an
>>>>>>>>>>>>> <function>XLogFlush</function> request (see ...). This is also
>>>>>>>>>>>>> incremented by the WAL receiver during replication.
>>>>>>>>>>>>>
>>>>>>>>>>>>> ("which normally called" should be "which is normally called" or
>>>>>>>>>>>>> "which normally is called" if you want to keep true to the original)
>>>>>>>>>>>>> You missed the adding the space before an opening parenthesis here and
>>>>>>>>>>>>> elsewhere (probably copy-paste)
>>>>>>>>>>>>>
>>>>>>>>>>>>> is ether -> is either
>>>>>>>>>>>>> "This parameter is off by default as it will repeatedly query the
>>>>>>>>>>>>> operating system..."
>>>>>>>>>>>>> ", because" -> "as"
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks, I fixed them.
>>>>>>>>>>>>
>>>>>>>>>>>>> wal_write_time and the sync items also need the note: "This is also
>>>>>>>>>>>>> incremented by the WAL receiver during replication."
>>>>>>>>>>>>
>>>>>>>>>>>> I skipped changing it since I separated the stats for the WAL receiver
>>>>>>>>>>>> in pg_stat_wal_receiver.
>>>>>>>>>>>>
>>>>>>>>>>>>> "The number of times it happened..." -> " (the tally of this event is
>>>>>>>>>>>>> reported in wal_buffers_full in....) This is undesirable because ..."
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks, I fixed it.
>>>>>>>>>>>>
>>>>>>>>>>>>> I notice that the patch for WAL receiver doesn't require explicitly
>>>>>>>>>>>>> computing the sync statistics but does require computing the write
>>>>>>>>>>>>> statistics. This is because of the presence of issue_xlog_fsync but
>>>>>>>>>>>>> absence of an equivalent pg_xlog_pwrite. Additionally, I observe that
>>>>>>>>>>>>> the XLogWrite code path calls pgstat_report_wait_*() while the WAL
>>>>>>>>>>>>> receiver path does not. It seems technically straight-forward to
>>>>>>>>>>>>> refactor here to avoid the almost-duplicated logic in the two places,
>>>>>>>>>>>>> though I suspect there may be a trade-off for not adding another
>>>>>>>>>>>>> function call to the stack given the importance of WAL processing
>>>>>>>>>>>>> (though that seems marginalized compared to the cost of actually
>>>>>>>>>>>>> writing the WAL). Or, as Fujii noted, go the other way and don't have
>>>>>>>>>>>>> any shared code between the two but instead implement the WAL receiver
>>>>>>>>>>>>> one to use pg_stat_wal_receiver instead. In either case, this
>>>>>>>>>>>>> half-and-half implementation seems undesirable.
>>>>>>>>>>>>
>>>>>>>>>>>> OK, as Fujii-san mentioned, I separated the WAL receiver stats.
>>>>>>>>>>>> (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>>>>>>>>>>
>>>>>>>>>>> Thanks for updating the patches!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and
>>>>>>>>>>>> the stats for WAL receiver is counted in pg_stat_wal_receiver.
>>>>>>>>>>>> What do you think?
>>>>>>>>>>>
>>>>>>>>>>> On second thought, this idea seems not good. Because those stats are
>>>>>>>>>>> collected between multiple walreceivers, but other values in
>>>>>>>>>>> pg_stat_wal_receiver is only related to the walreceiver process running
>>>>>>>>>>> at that moment. IOW, it seems strange that some values show dynamic
>>>>>>>>>>> stats and the others show collected stats, even though they are in
>>>>>>>>>>> the same view pg_stat_wal_receiver. Thought?
>>>>>>>>>>
>>>>>>>>>> OK, I fixed it.
>>>>>>>>>> The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch.
>>>>>>>>>
>>>>>>>>> Thanks for updating the patches! I'm now reading 001 patch.
>>>>>>>>>
>>>>>>>>> + /* Check whether the WAL file was synced to disk right now */
>>>>>>>>> + if (enableFsync &&
>>>>>>>>> + (sync_method == SYNC_METHOD_FSYNC ||
>>>>>>>>> + sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
>>>>>>>>> + sync_method == SYNC_METHOD_FDATASYNC))
>>>>>>>>> + {
>>>>>>>>>
>>>>>>>>> Isn't it better to make issue_xlog_fsync() return immediately
>>>>>>>>> if enableFsync is off, sync_method is open_sync or open_data_sync,
>>>>>>>>> to simplify the code more?
>>>>>>>>
>>>>>>>> Thanks for the comments.
>>>>>>>> I added the above code in v12 patch.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> + /*
>>>>>>>>> + * Send WAL statistics only if WalWriterDelay has elapsed to minimize
>>>>>>>>> + * the overhead in WAL-writing.
>>>>>>>>> + */
>>>>>>>>> + if (rc & WL_TIMEOUT)
>>>>>>>>> + pgstat_send_wal();
>>>>>>>>>
>>>>>>>>> On second thought, this change means that it always takes wal_writer_delay
>>>>>>>>> before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
>>>>>>>>> For example, if wal_writer_delay is set to several seconds, some values in
>>>>>>>>> pg_stat_wal would be not up-to-date meaninglessly for those seconds.
>>>>>>>>> So I'm thinking to withdraw my previous comment and it's ok to send
>>>>>>>>> the stats every after XLogBackgroundFlush() is called. Thought?
>>>>>>>>
>>>>>>>> Thanks, I didn't notice that.
>>>>>>>>
>>>>>>>> Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
>>>>>>>> default value is 200msec and it may be set shorter time.
>>>>>>
>>>>>> Yeah, if wal_writer_delay is set to very small value, there is a risk
>>>>>> that the WAL stats are sent too frequently. I agree that's a problem.
>>>>>>
>>>>>>>>
>>>>>>>> Why don't to make another way to check the timestamp?
>>>>>>>>
>>>>>>>> + /*
>>>>>>>> + * Don't send a message unless it's been at least
>>>>>>>> PGSTAT_STAT_INTERVAL
>>>>>>>> + * msec since we last sent one
>>>>>>>> + */
>>>>>>>> + now = GetCurrentTimestamp();
>>>>>>>> + if (TimestampDifferenceExceeds(last_report, now,
>>>>>>>> PGSTAT_STAT_INTERVAL))
>>>>>>>> + {
>>>>>>>> + pgstat_send_wal();
>>>>>>>> + last_report = now;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>>
>>>>>>>> Although I worried that it's better to add the check code in pgstat_send_wal(),
>>>>>>
>>>>>> Agreed.
>>>>>>
>>>>>>>> I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
>>>>>>>> pgstat_send_wal() is invoked pg_report_stat() and it already checks the
>>>>>>>> PGSTAT_STAT_INTERVAL.
>>>>>>
>>>>>> I think that we can do that. What about the attached patch?
>>>>>
>>>>> Thanks, I thought it's better.
>>>>>
>>>>>
>>>>>>> I forgot to remove an unused variable.
>>>>>>> The attached v13 patch is fixed.
>>>>>>
>>>>>> Thanks for updating the patch!
>>>>>>
>>>>>> + w.wal_write,
>>>>>> + w.wal_write_time,
>>>>>> + w.wal_sync,
>>>>>> + w.wal_sync_time,
>>>>>>
>>>>>> It's more natural to put wal_write_time and wal_sync_time next to
>>>>>> each other? That is, what about the following order of columns?
>>>>>>
>>>>>> wal_write
>>>>>> wal_sync
>>>>>> wal_write_time
>>>>>> wal_sync_time
>>>>>
>>>>> Yes, I fixed it.
>>>>>
>>>>>> - case SYNC_METHOD_OPEN:
>>>>>> - case SYNC_METHOD_OPEN_DSYNC:
>>>>>> - /* write synced it already */
>>>>>> - break;
>>>>>>
>>>>>> IMO it's better to add Assert(false) here to ensure that we never reach
>>>>>> here, as follows. Thought?
>>>>>>
>>>>>> + case SYNC_METHOD_OPEN:
>>>>>> + case SYNC_METHOD_OPEN_DSYNC:
>>>>>> + /* not reachable */
>>>>>> + Assert(false);
>>>>>
>>>>> I agree.
>>>>>
>>>>>
>>>>>> Even when a backend exits, it sends the stats via pgstat_beshutdown_hook().
>>>>>> On the other hand, walwriter doesn't do that. Walwriter also should send
>>>>>> the stats even at its exit? Otherwise some stats can fail to be collected.
>>>>>> But ISTM that this issue existed from before, for example checkpointer
>>>>>> doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to fix
>>>>>> this issue in this patch?
>>>>>
>>>>> Thanks, I thought it's better to do so.
>>>>> I added the shutdown hook for the walwriter and the checkpointer in v14-0003 patch.
>>>>
>>>> Thanks!
>>>>
>>>> Seems you forgot to include the changes of expected/rules.out in 0001 patch,
>>>> and which caused the regression test to fail. Attached is the updated version
>>>> of the patch. I included expected/rules.out in it.
>>>
>>> Sorry.
>>>
>>>> + PgStat_Counter m_wal_write_time; /* time spend writing wal records in
>>>> + * micro seconds */
>>>> + PgStat_Counter m_wal_sync_time; /* time spend syncing wal records in micro
>>>> + * seconds */
>>>>
>>>> IMO "spend" should be "spent". Also "micro seconds" should be "microseconds"
>>>> in sake of consistent with other comments in pgstat.h. I fixed them.
>>>
>>> Thanks.
>>>
>>>> Regarding pgstat_report_wal() and pgstat_send_wal(), I found one bug. Even
>>>> when pgstat_send_wal() returned without sending any message,
>>>> pgstat_report_wal() saved current pgWalUsage and that counter was used for
>>>> the subsequent calculation of WAL usage. This caused some counters not to
>>>> be sent to the collector. This is a bug that I added. I fixed this bug.
>>>
>>> Thanks.
>>>
>>>
>>>> + walStats.wal_write += msg->m_wal_write;
>>>> + walStats.wal_write_time += msg->m_wal_write_time;
>>>> + walStats.wal_sync += msg->m_wal_sync;
>>>> + walStats.wal_sync_time += msg->m_wal_sync_time;
>>>>
>>>> I changed the order of the above in pgstat.c so that wal_write_time and
>>>> wal_sync_time are placed in next to each other.
>>>
>>> I forgot to fix them, thanks.
>>>
>>>
>>>> The followings are the comments for the docs part. I've not updated this
>>>> in the patch yet because I'm not sure how to change them for now.
>>>> + Number of times WAL buffers were written out to disk via
>>>> + <function>XLogWrite</function>, which is invoked during an
>>>> + <function>XLogFlush</function> request (see <xref
>>>> linkend="wal-configuration"/>)
>>>> + </para></entry>
>>>>
>>>> XLogWrite() can be invoked during the functions other than XLogFlush().
>>>> For example, XLogBackgroundFlush(). So the above description might be
>>>> confusing?
>>>>
>>>> + Number of times WAL files were synced to disk via
>>>> + <function>issue_xlog_fsync</function>, which is invoked during an
>>>> + <function>XLogFlush</function> request (see <xref
>>>> linkend="wal-configuration"/>)
>>>>
>>>> Same as above.
>>>
>>> Yes, why don't you remove "XLogFlush" in the above comments
>>> because XLogWrite() description is covered in wal.sgml?
>>>
>>> But, now it's mentioned only for backend,
>>> I added the comments for the wal writer in the attached patch.
>>>
>>>
>>>> + while <xref linkend="guc-wal-sync-method"/> was set to one of the
>>>> + "sync at commit" options (i.e., <literal>fdatasync</literal>,
>>>> + <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
>>>>
>>>> Even open_sync and open_datasync do the sync at commit. No? I'm not sure
>>>> if "sync at commit" is right term to indicate fdatasync, fsync and
>>>> fsync_writethrough.
>>>
>>> Yes, why don't you change to the following comments?
>>>
>>> ```
>>> while <xref linkend="guc-wal-sync-method"/> was set to one of the
>>> options which specific fsync method is called (i.e., <literal>fdatasync</literal>,
>>> <literal>fsync</literal>, or <literal>fsync_writethrough</literal>)
>>> ```
>>>
>>>> + <literal>open_sync</literal>. Units are in milliseconds with
>>>> microsecond resolution.
>>>>
>>>> "with microsecond resolution" part is really necessary?
>>>
>>> I removed it because blk_read_time in pg_stat_database is the same above,
>>> but it doesn't mention it.
>>>
>>>
>>>> + transaction records are flushed to permanent storage.
>>>> + <function>XLogFlush</function> calls <function>XLogWrite</function> to write
>>>> + and <function>issue_xlog_fsync</function> to flush them, which are
>>>> counted as
>>>> + <literal>wal_write</literal> and <literal>wal_sync</literal> in
>>>> + <xref linkend="pg-stat-wal-view"/>. On systems with high log output,
>>>>
>>>> This description might cause users to misread that XLogFlush() calls
>>>> issue_xlog_fsync(). Since issue_xlog_fsync() is called by XLogWrite(),
>>>> ISTM that this description needs to be updated.
>>>
>>> I understood. I fixed to mention that XLogWrite()
>>> calls issue_xlog_fsync().
>>>
>>>
>>>> Each line in the above seems to end with a space character.
>>>> This space character should be removed.
>>>
>>> Sorry for that. I removed it.
>>
>> Thanks for updating the patch! I think it's getting good shape!
>> - pid | wait_event_type | wait_event
>> + pid | wait_event_type | wait_event
>>
>> This change is not necessary?
>
> No, sorry.
> I removed it by mistake when I remove trailing space characters.
>
>
>> - every <xref linkend="guc-wal-writer-delay"/> milliseconds.
>> + every <xref linkend="guc-wal-writer-delay"/> milliseconds, which calls
>> + <function>XLogWrite</function> to write and <function>XLogWrite</function>
>> + <function>issue_xlog_fsync</function> to flush them. They are counted as
>> + <literal>wal_write</literal> and <literal>wal_sync</literal> in
>> + <xref linkend="pg-stat-wal-view"/>.
>>
>> Isn't it better to avoid using the terms like XLogWrite or issue_xlog_fsync
>> before explaining what they are? They are explained later. At least for me
>> I'm ok without this change.
>
> OK. I removed them and add a new paragraph.
>
>
>> - to write (move to kernel cache) a few filled <acronym>WAL</acronym>
>> - buffers. This is undesirable because <function>XLogInsertRecord</function>
>> + to call <function>XLogWrite</function> to write (move to kernel cache) a
>> + few filled <acronym>WAL</acronym> buffers (the tally of this event
>> is reported in
>> + <literal>wal_buffers_full</literal> in <xref linkend="pg-stat-wal-view"/>).
>> + This is undesirable because <function>XLogInsertRecord</function>
>>
>> This paragraph explains the relationshp between WAL writes and WAL
>> buffers. I don't think it's good to add different context to this
>> paragraph. Instead, what about adding new paragraph like the follwing?
>>
>> ----------------------------------
>> When track_wal_io_timing is enabled, the total amounts of time
>> XLogWrite writes and issue_xlog_fsync syncs WAL data to disk are
>> counted as wal_write_time and wal_sync_time in pg_stat_wal view,
>> respectively. XLogWrite is normally called by XLogInsertRecord (when
>> there is no space for the new record in WAL buffers), XLogFlush and
>> the WAL writer, to write WAL buffers to disk and call
>> issue_xlog_fsync. If wal_sync_method is either open_datasync or
>> open_sync, a write operation in XLogWrite guarantees to sync written
>> WAL data to disk and issue_xlog_fsync does nothing. If wal_sync_method
>> is either fdatasync, fsync, or fsync_writethrough, the write operation
>> moves WAL buffer to kernel cache and issue_xlog_fsync syncs WAL files
>> to disk. Regardless of the setting of track_wal_io_timing, the numbers
>> of times XLogWrite writes and issue_xlog_fsync syncs WAL data to disk
>> are also counted as wal_write and wal_sync in pg_stat_wal,
>> respectively.
>> ----------------------------------
>
> Thanks, I agree it's better.
>
>
>> + <function>issue_xlog_fsync</function> (see <xref
>> linkend="wal-configuration"/>)
>>
>> "request" should be place just before "(see"?
>
> Yes, thanks.
>
>
>
>> + Number of times WAL files were synced to disk via
>> + <function>issue_xlog_fsync</function> (see <xref
>> linkend="wal-configuration"/>)
>> + while <xref linkend="guc-wal-sync-method"/> was set to one of the
>> + options which specific fsync method is called (i.e.,
>> <literal>fdatasync</literal>,
>> + <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
>>
>> Isn't it better to mention the case of fsync=off? What about the following?
>>
>> ----------------------------------
>> Number of times WAL files were synced to disk via issue_xlog_fsync
>> (see ...). This is zero when fsync is off or wal_sync_method is either
>> open_datasync or open_sync.
>> ----------------------------------
>
> Yes.
>
>
>> + Total amount of time spent writing WAL buffers were written
>> out to disk via
>>
>> "were written out" is not necessary?
>
> Yes, removed it.
>
>> + Total amount of time spent syncing WAL files to disk via
>> + <function>issue_xlog_fsync</function> request (see <xref
>> linkend="wal-configuration"/>)
>> + while <xref linkend="guc-wal-sync-method"/> was set to one of the
>> + options which specific fsync method is called (i.e.,
>> <literal>fdatasync</literal>,
>> + <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
>> + Units are in milliseconds.
>> + This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
>>
>> Isn't it better to explain the case where this counter is zero a bit
>> more clearly as follows?
>>
>> ---------------------
>> This is zero when track_wal_io_timing is disabled, fsync is off, or
>> wal_sync_method is either open_datasync or open_sync.
>> ---------------------
>
> Yes, thanks.
Thanks for updating the patch! I applied cosmetic changes to that.
Patch attached. Barring any objection, I will commit this version.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Attachment | Content-Type | Size |
---|---|---|
v16-0001-Add-statistics-related-to-write-sync-wal-records_fujii.patch | text/plain | 21.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Fujii Masao | 2021-03-08 15:54:18 | Re: [PATCH] pgbench: improve \sleep meta command |
Previous Message | Mark Dilger | 2021-03-08 15:39:44 | Re: proposal: psql –help reflecting service or URI usage |