Re: About to add WAL write/fsync statistics to pg_stat_wal view

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject: Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date: 2021-03-08 15:48:00
Message-ID: c4e1ebb8-6cd7-ebbf-a770-c65ea54239ce@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2021/03/08 19:42, Masahiro Ikeda wrote:
> On 2021-03-08 13:44, Fujii Masao wrote:
>> On 2021/03/05 19:54, Masahiro Ikeda wrote:
>>> On 2021-03-05 12:47, Fujii Masao wrote:
>>>> On 2021/03/05 8:38, Masahiro Ikeda wrote:
>>>>> On 2021-03-05 01:02, Fujii Masao wrote:
>>>>>> On 2021/03/04 16:14, Masahiro Ikeda wrote:
>>>>>>> On 2021-03-03 20:27, Masahiro Ikeda wrote:
>>>>>>>> On 2021-03-03 16:30, Fujii Masao wrote:
>>>>>>>>> On 2021/03/03 14:33, Masahiro Ikeda wrote:
>>>>>>>>>> On 2021-02-24 16:14, Fujii Masao wrote:
>>>>>>>>>>> On 2021/02/15 11:59, Masahiro Ikeda wrote:
>>>>>>>>>>>> On 2021-02-10 00:51, David G. Johnston wrote:
>>>>>>>>>>>>> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
>>>>>>>>>>>>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I pgindented the patches.
>>>>>>>>>>>>>
>>>>>>>>>>>>> ... <function>XLogWrite</function>, which is invoked during an
>>>>>>>>>>>>> <function>XLogFlush</function> request (see ...).  This is also
>>>>>>>>>>>>> incremented by the WAL receiver during replication.
>>>>>>>>>>>>>
>>>>>>>>>>>>> ("which normally called" should be "which is normally called" or
>>>>>>>>>>>>> "which normally is called" if you want to keep true to the original)
>>>>>>>>>>>>> You missed the adding the space before an opening parenthesis here and
>>>>>>>>>>>>> elsewhere (probably copy-paste)
>>>>>>>>>>>>>
>>>>>>>>>>>>> is ether -> is either
>>>>>>>>>>>>> "This parameter is off by default as it will repeatedly query the
>>>>>>>>>>>>> operating system..."
>>>>>>>>>>>>> ", because" -> "as"
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks, I fixed them.
>>>>>>>>>>>>
>>>>>>>>>>>>> wal_write_time and the sync items also need the note: "This is also
>>>>>>>>>>>>> incremented by the WAL receiver during replication."
>>>>>>>>>>>>
>>>>>>>>>>>> I skipped changing it since I separated the stats for the WAL receiver
>>>>>>>>>>>> in pg_stat_wal_receiver.
>>>>>>>>>>>>
>>>>>>>>>>>>> "The number of times it happened..." -> " (the tally of this event is
>>>>>>>>>>>>> reported in wal_buffers_full in....) This is undesirable because ..."
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks, I fixed it.
>>>>>>>>>>>>
>>>>>>>>>>>>> I notice that the patch for WAL receiver doesn't require explicitly
>>>>>>>>>>>>> computing the sync statistics but does require computing the write
>>>>>>>>>>>>> statistics.  This is because of the presence of issue_xlog_fsync but
>>>>>>>>>>>>> absence of an equivalent pg_xlog_pwrite.  Additionally, I observe that
>>>>>>>>>>>>> the XLogWrite code path calls pgstat_report_wait_*() while the WAL
>>>>>>>>>>>>> receiver path does not.  It seems technically straight-forward to
>>>>>>>>>>>>> refactor here to avoid the almost-duplicated logic in the two places,
>>>>>>>>>>>>> though I suspect there may be a trade-off for not adding another
>>>>>>>>>>>>> function call to the stack given the importance of WAL processing
>>>>>>>>>>>>> (though that seems marginalized compared to the cost of actually
>>>>>>>>>>>>> writing the WAL).  Or, as Fujii noted, go the other way and don't have
>>>>>>>>>>>>> any shared code between the two but instead implement the WAL receiver
>>>>>>>>>>>>> one to use pg_stat_wal_receiver instead.  In either case, this
>>>>>>>>>>>>> half-and-half implementation seems undesirable.
>>>>>>>>>>>>
>>>>>>>>>>>> OK, as Fujii-san mentioned, I separated the WAL receiver stats.
>>>>>>>>>>>> (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>>>>>>>>>>
>>>>>>>>>>> Thanks for updating the patches!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and
>>>>>>>>>>>> the stats for WAL receiver is counted in pg_stat_wal_receiver.
>>>>>>>>>>>> What do you think?
>>>>>>>>>>>
>>>>>>>>>>> On second thought, this idea seems not good. Because those stats are
>>>>>>>>>>> collected between multiple walreceivers, but other values in
>>>>>>>>>>> pg_stat_wal_receiver is only related to the walreceiver process running
>>>>>>>>>>> at that moment. IOW, it seems strange that some values show dynamic
>>>>>>>>>>> stats and the others show collected stats, even though they are in
>>>>>>>>>>> the same view pg_stat_wal_receiver. Thought?
>>>>>>>>>>
>>>>>>>>>> OK, I fixed it.
>>>>>>>>>> The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch.
>>>>>>>>>
>>>>>>>>> Thanks for updating the patches! I'm now reading 001 patch.
>>>>>>>>>
>>>>>>>>> +    /* Check whether the WAL file was synced to disk right now */
>>>>>>>>> +    if (enableFsync &&
>>>>>>>>> +        (sync_method == SYNC_METHOD_FSYNC ||
>>>>>>>>> +         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
>>>>>>>>> +         sync_method == SYNC_METHOD_FDATASYNC))
>>>>>>>>> +    {
>>>>>>>>>
>>>>>>>>> Isn't it better to make issue_xlog_fsync() return immediately
>>>>>>>>> if enableFsync is off, sync_method is open_sync or open_data_sync,
>>>>>>>>> to simplify the code more?
>>>>>>>>
>>>>>>>> Thanks for the comments.
>>>>>>>> I added the above code in v12 patch.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> +        /*
>>>>>>>>> +         * Send WAL statistics only if WalWriterDelay has elapsed to minimize
>>>>>>>>> +         * the overhead in WAL-writing.
>>>>>>>>> +         */
>>>>>>>>> +        if (rc & WL_TIMEOUT)
>>>>>>>>> +            pgstat_send_wal();
>>>>>>>>>
>>>>>>>>> On second thought, this change means that it always takes wal_writer_delay
>>>>>>>>> before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
>>>>>>>>> For example, if wal_writer_delay is set to several seconds, some values in
>>>>>>>>> pg_stat_wal would be not up-to-date meaninglessly for those seconds.
>>>>>>>>> So I'm thinking to withdraw my previous comment and it's ok to send
>>>>>>>>> the stats every after XLogBackgroundFlush() is called. Thought?
>>>>>>>>
>>>>>>>> Thanks, I didn't notice that.
>>>>>>>>
>>>>>>>> Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
>>>>>>>> default value is 200msec and it may be set shorter time.
>>>>>>
>>>>>> Yeah, if wal_writer_delay is set to very small value, there is a risk
>>>>>> that the WAL stats are sent too frequently. I agree that's a problem.
>>>>>>
>>>>>>>>
>>>>>>>> Why don't to make another way to check the timestamp?
>>>>>>>>
>>>>>>>> +               /*
>>>>>>>> +                * Don't send a message unless it's been at least
>>>>>>>> PGSTAT_STAT_INTERVAL
>>>>>>>> +                * msec since we last sent one
>>>>>>>> +                */
>>>>>>>> +               now = GetCurrentTimestamp();
>>>>>>>> +               if (TimestampDifferenceExceeds(last_report, now,
>>>>>>>> PGSTAT_STAT_INTERVAL))
>>>>>>>> +               {
>>>>>>>> +                       pgstat_send_wal();
>>>>>>>> +                       last_report = now;
>>>>>>>> +               }
>>>>>>>> +
>>>>>>>>
>>>>>>>> Although I worried that it's better to add the check code in pgstat_send_wal(),
>>>>>>
>>>>>> Agreed.
>>>>>>
>>>>>>>> I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
>>>>>>>> pgstat_send_wal() is invoked pg_report_stat() and it already checks the
>>>>>>>> PGSTAT_STAT_INTERVAL.
>>>>>>
>>>>>> I think that we can do that. What about the attached patch?
>>>>>
>>>>> Thanks, I thought it's better.
>>>>>
>>>>>
>>>>>>> I forgot to remove an unused variable.
>>>>>>> The attached v13 patch is fixed.
>>>>>>
>>>>>> Thanks for updating the patch!
>>>>>>
>>>>>> +        w.wal_write,
>>>>>> +        w.wal_write_time,
>>>>>> +        w.wal_sync,
>>>>>> +        w.wal_sync_time,
>>>>>>
>>>>>> It's more natural to put wal_write_time and wal_sync_time next to
>>>>>> each other? That is, what about the following order of columns?
>>>>>>
>>>>>> wal_write
>>>>>> wal_sync
>>>>>> wal_write_time
>>>>>> wal_sync_time
>>>>>
>>>>> Yes, I fixed it.
>>>>>
>>>>>> -        case SYNC_METHOD_OPEN:
>>>>>> -        case SYNC_METHOD_OPEN_DSYNC:
>>>>>> -            /* write synced it already */
>>>>>> -            break;
>>>>>>
>>>>>> IMO it's better to add Assert(false) here to ensure that we never reach
>>>>>> here, as follows. Thought?
>>>>>>
>>>>>> +        case SYNC_METHOD_OPEN:
>>>>>> +        case SYNC_METHOD_OPEN_DSYNC:
>>>>>> +            /* not reachable */
>>>>>> +            Assert(false);
>>>>>
>>>>> I agree.
>>>>>
>>>>>
>>>>>> Even when a backend exits, it sends the stats via pgstat_beshutdown_hook().
>>>>>> On the other hand, walwriter doesn't do that. Walwriter also should send
>>>>>> the stats even at its exit? Otherwise some stats can fail to be collected.
>>>>>> But ISTM that this issue existed from before, for example checkpointer
>>>>>> doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to fix
>>>>>> this issue in this patch?
>>>>>
>>>>> Thanks, I thought it's better to do so.
>>>>> I added the shutdown hook for the walwriter and the checkpointer in v14-0003 patch.
>>>>
>>>> Thanks!
>>>>
>>>> Seems you forgot to include the changes of expected/rules.out in 0001 patch,
>>>> and which caused the regression test to fail. Attached is the updated version
>>>> of the patch. I included expected/rules.out in it.
>>>
>>> Sorry.
>>>
>>>> +    PgStat_Counter m_wal_write_time;    /* time spend writing wal records in
>>>> +                                         * micro seconds */
>>>> +    PgStat_Counter m_wal_sync_time; /* time spend syncing wal records in micro
>>>> +                                     * seconds */
>>>>
>>>> IMO "spend" should be "spent". Also "micro seconds" should be "microseconds"
>>>> in sake of consistent with other comments in pgstat.h. I fixed them.
>>>
>>> Thanks.
>>>
>>>> Regarding pgstat_report_wal() and pgstat_send_wal(), I found one bug. Even
>>>> when pgstat_send_wal() returned without sending any message,
>>>> pgstat_report_wal() saved current pgWalUsage and that counter was used for
>>>> the subsequent calculation of WAL usage. This caused some counters not to
>>>> be sent to the collector. This is a bug that I added. I fixed this bug.
>>>
>>> Thanks.
>>>
>>>
>>>> +    walStats.wal_write += msg->m_wal_write;
>>>> +    walStats.wal_write_time += msg->m_wal_write_time;
>>>> +    walStats.wal_sync += msg->m_wal_sync;
>>>> +    walStats.wal_sync_time += msg->m_wal_sync_time;
>>>>
>>>> I changed the order of the above in pgstat.c so that wal_write_time and
>>>> wal_sync_time are placed in next to each other.
>>>
>>> I forgot to fix them, thanks.
>>>
>>>
>>>> The followings are the comments for the docs part. I've not updated this
>>>> in the patch yet because I'm not sure how to change them for now.
>>>> +       Number of times WAL buffers were written out to disk via
>>>> +       <function>XLogWrite</function>, which is invoked during an
>>>> +       <function>XLogFlush</function> request (see <xref
>>>> linkend="wal-configuration"/>)
>>>> +      </para></entry>
>>>>
>>>> XLogWrite() can be invoked during the functions other than XLogFlush().
>>>> For example, XLogBackgroundFlush(). So the above description might be
>>>> confusing?
>>>>
>>>> +       Number of times WAL files were synced to disk via
>>>> +       <function>issue_xlog_fsync</function>, which is invoked during an
>>>> +       <function>XLogFlush</function> request (see <xref
>>>> linkend="wal-configuration"/>)
>>>>
>>>> Same as above.
>>>
>>> Yes, why don't you remove "XLogFlush" in the above comments
>>> because XLogWrite() description is covered in wal.sgml?
>>>
>>> But, now it's mentioned only for backend,
>>> I added the comments for the wal writer in the attached patch.
>>>
>>>
>>>> +       while <xref linkend="guc-wal-sync-method"/> was set to one of the
>>>> +       "sync at commit" options (i.e., <literal>fdatasync</literal>,
>>>> +       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
>>>>
>>>> Even open_sync and open_datasync do the sync at commit. No? I'm not sure
>>>> if "sync at commit" is right term to indicate fdatasync, fsync and
>>>> fsync_writethrough.
>>>
>>> Yes, why don't you change to the following comments?
>>>
>>> ```
>>>         while <xref linkend="guc-wal-sync-method"/> was set to one of the
>>>         options which specific fsync method is called (i.e., <literal>fdatasync</literal>,
>>>         <literal>fsync</literal>, or <literal>fsync_writethrough</literal>)
>>> ```
>>>
>>>> +       <literal>open_sync</literal>. Units are in milliseconds with
>>>> microsecond resolution.
>>>>
>>>> "with microsecond resolution" part is really necessary?
>>>
>>> I removed it because blk_read_time in pg_stat_database is the same above,
>>> but it doesn't mention it.
>>>
>>>
>>>> +   transaction records are flushed to permanent storage.
>>>> +   <function>XLogFlush</function> calls <function>XLogWrite</function> to write
>>>> +   and <function>issue_xlog_fsync</function> to flush them, which are
>>>> counted as
>>>> +   <literal>wal_write</literal> and <literal>wal_sync</literal> in
>>>> +   <xref linkend="pg-stat-wal-view"/>. On systems with high log output,
>>>>
>>>> This description might cause users to misread that XLogFlush() calls
>>>> issue_xlog_fsync(). Since issue_xlog_fsync() is called by XLogWrite(),
>>>> ISTM that this description needs to be updated.
>>>
>>> I understood. I fixed to mention that XLogWrite()
>>> calls issue_xlog_fsync().
>>>
>>>
>>>> Each line in the above seems to end with a space character.
>>>> This space character should be removed.
>>>
>>> Sorry for that. I removed it.
>>
>> Thanks for updating the patch! I think it's getting good shape!
>> - pid  | wait_event_type | wait_event
>> + pid  | wait_event_type | wait_event
>>
>> This change is not necessary?
>
> No, sorry.
> I removed it by mistake when I remove trailing space characters.
>
>
>> -   every <xref linkend="guc-wal-writer-delay"/> milliseconds.
>> +   every <xref linkend="guc-wal-writer-delay"/> milliseconds, which calls
>> +   <function>XLogWrite</function> to write and <function>XLogWrite</function>
>> +   <function>issue_xlog_fsync</function> to flush them. They are counted as
>> +   <literal>wal_write</literal> and <literal>wal_sync</literal> in
>> +   <xref linkend="pg-stat-wal-view"/>.
>>
>> Isn't it better to avoid using the terms like XLogWrite or issue_xlog_fsync
>> before explaining what they are? They are explained later. At least for me
>> I'm ok without this change.
>
> OK. I removed them and add a new paragraph.
>
>
>> -   to write (move to kernel cache) a few filled <acronym>WAL</acronym>
>> -   buffers. This is undesirable because <function>XLogInsertRecord</function>
>> +   to call <function>XLogWrite</function> to write (move to kernel cache) a
>> +   few filled <acronym>WAL</acronym> buffers (the tally of this event
>> is reported in
>> +   <literal>wal_buffers_full</literal> in <xref linkend="pg-stat-wal-view"/>).
>> +   This is undesirable because <function>XLogInsertRecord</function>
>>
>> This paragraph explains the relationshp between WAL writes and WAL
>> buffers. I don't think it's good to add different context to this
>> paragraph. Instead, what about adding new paragraph like the follwing?
>>
>> ----------------------------------
>> When track_wal_io_timing is enabled, the total amounts of time
>> XLogWrite writes and issue_xlog_fsync syncs WAL data to disk are
>> counted as wal_write_time and wal_sync_time in pg_stat_wal view,
>> respectively. XLogWrite is normally called by XLogInsertRecord (when
>> there is no space for the new record in WAL buffers), XLogFlush and
>> the WAL writer, to write WAL buffers to disk and call
>> issue_xlog_fsync. If wal_sync_method is either open_datasync or
>> open_sync, a write operation in XLogWrite guarantees to sync written
>> WAL data to disk and issue_xlog_fsync does nothing. If wal_sync_method
>> is either fdatasync, fsync, or fsync_writethrough, the write operation
>> moves WAL buffer to kernel cache and issue_xlog_fsync syncs WAL files
>> to disk. Regardless of the setting of track_wal_io_timing, the numbers
>> of times XLogWrite writes and issue_xlog_fsync syncs WAL data to disk
>> are also counted as wal_write and wal_sync in pg_stat_wal,
>> respectively.
>> ----------------------------------
>
> Thanks, I agree it's better.
>
>
>> +       <function>issue_xlog_fsync</function> (see <xref
>> linkend="wal-configuration"/>)
>>
>> "request" should be place just before "(see"?
>
> Yes, thanks.
>
>
>
>> +       Number of times WAL files were synced to disk via
>> +       <function>issue_xlog_fsync</function> (see <xref
>> linkend="wal-configuration"/>)
>> +       while <xref linkend="guc-wal-sync-method"/> was set to one of the
>> +       options which specific fsync method is called (i.e.,
>> <literal>fdatasync</literal>,
>> +       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
>>
>> Isn't it better to mention the case of fsync=off? What about the following?
>>
>> ----------------------------------
>> Number of times WAL files were synced to disk via issue_xlog_fsync
>> (see ...). This is zero when fsync is off or wal_sync_method is either
>> open_datasync or open_sync.
>> ----------------------------------
>
> Yes.
>
>
>> +       Total amount of time spent writing WAL buffers were written
>> out to disk via
>>
>> "were written out" is not necessary?
>
> Yes, removed it.
>
>> +       Total amount of time spent syncing WAL files to disk via
>> +       <function>issue_xlog_fsync</function> request (see <xref
>> linkend="wal-configuration"/>)
>> +       while <xref linkend="guc-wal-sync-method"/> was set to one of the
>> +       options which specific fsync method is called (i.e.,
>> <literal>fdatasync</literal>,
>> +       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
>> +       Units are in milliseconds.
>> +       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
>>
>> Isn't it better to explain the case where this counter is zero a bit
>> more clearly as follows?
>>
>> ---------------------
>> This is zero when track_wal_io_timing is disabled, fsync is off, or
>> wal_sync_method is either open_datasync or open_sync.
>> ---------------------
>
> Yes, thanks.

Thanks for updating the patch! I applied cosmetic changes to that.
Patch attached. Barring any objection, I will commit this version.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment Content-Type Size
v16-0001-Add-statistics-related-to-write-sync-wal-records_fujii.patch text/plain 21.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2021-03-08 15:54:18 Re: [PATCH] pgbench: improve \sleep meta command
Previous Message Mark Dilger 2021-03-08 15:39:44 Re: proposal: psql –help reflecting service or URI usage