Quick Links

Re: Show WAL write and fsync stats in pg_stat_io

From:	Melanie Plageman <melanieplageman(at)gmail(dot)com>
To:	Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, "bharath(dot)rupireddyforpostgres(at)gmail(dot)com" <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject:	Re: Show WAL write and fsync stats in pg_stat_io
Date:	2024-01-11 14:27:53
Message-ID:	CAAKRu_bU2-p77kGsed=V_C=hxKqREy=Wyx=zEwgwBvriwjGCTA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Jan 11, 2024 at 6:19 AM Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com> wrote:
>
> On Thu, 11 Jan 2024 at 08:01, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> >
> > On Wed, Jan 10, 2024 at 07:24:50PM -0500, Melanie Plageman wrote:
> > > On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com> wrote:
> > >> On Sun, 31 Dec 2023 at 03:58, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> > >> Oh, I understand it now. Yes, that makes sense.
> > >> I thought removing op_bytes completely ( as you said "This patch
> > >> extends it with two more operation sizes, and there are even cases
> > >> where it may be a variable" ) from pg_stat_io view then adding
> > >> something like {read | write | extend}_bytes and {read | write |
> > >> extend}_calls could be better, so that we don't lose any information.
> > >
> > > Upthread, Michael says:
> > >
> > >> I find the use of 1 in this context a bit confusing, because when
> > >> referring to a counter at N, then it can be understood as doing N
> > >> times a operation,
> > >
> > > I didn't understand this argument, so I'm not sure if I agree or
> > > disagree with it.
> >
> > Nazir has mentioned upthread one thing: what should we do for the case
> > where a combination of (io_object,io_context) does I/O with a
> > *variable* op_bytes, because that may be the case for the WAL
> > receiver? For this case, he has mentioned that we should set op_bytes
> > to 1, but that's something I find confusing because it would mean that
> > we are doing read, writes or extends 1 byte at a time. My suggestion
> > would be to use op_bytes = -1 or NULL for the variable case instead,
> > with reads, writes and extends referring to a number of bytes rather
> > than a number of operations.
>
> I agree but we can't do this only for the *variable* cases since
> B_WAL_RECEIVER and other backends use the same
> pgstat_count_io_op_time(IOObject, IOContext, ...) call. What I mean
> is, if two backends use the same pgstat_count_io_op_time() function
> call in the code; they must count the same thing (number of calls,
> bytes, etc.). It could be better to count the number of bytes for all
> the IOOBJECT_WAL IOs.

I'm a bit confused by this. pgstat_count_io_op_time() can check
MyBackendType. In fact, you do this to ban the wal receiver already.
It is true that you would need to count all wal receiver normal
context wal object IOOps in the variable way, but I don't see how
pgstat_count_io_op_time() is the limiting factor as long as the
callsite is always doing either the number of bytes or the number of
calls.

> > > I think these are the three proposals for handling WAL reads:
> > >
> > > 1) setting op_bytes to 1 and the number of reads is the number of bytes
> > > 2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the
> > > number of calls to pg_pread() or similar
> > > 3) setting op_bytes to NULL and the number of reads is the number of
> > > calls to pg_pread() or similar
> >
> > 3) could be a number of bytes, actually.
>
> One important point is that we can't change only reads, if we decide
> to count the number of bytes for the reads; writes and extends should
> be counted as a number of bytes as well.

Yes, that is true.

> > > Looking at the patch, I think it is still doing 2.
> >
> > The patch disables stats for the WAL receiver, while the startup
> > process reads WAL with XLOG_BLCKSZ, so yeah that's 2) with a trick to
> > discard the variable case.
> >
> > > For an unpopular idea: we could add separate [IOOp]_bytes columns for
> > > all those IOOps for which it would be relevant. It kind of stinks but
> > > it would give us the freedom to document exactly what a single IOOp
> > > means for each combination of BackendType, IOContext, IOObject, and
> > > IOOp (as relevant) and still have an accurate number in the *bytes
> > > columns. Everyone will probably hate us if we do that, though.
> > > Especially because having bytes for the existing IOObjects is an
> > > existing feature.
> >
> > An issue I have with this one is that having multiple tuples for
> > each (object,context) if they have multiple op_bytes leads to
> > potentially a lot of bloat in the view. That would be up to 8k extra
> > tuples just for the sake of op_byte's variability.
>
> Yes, that doesn't seem applicable to me.

My suggestion (again not sure it is a good idea) was actually that we
remove op_bytes and add "write_bytes", "read_bytes", and
"extend_bytes". AFAICT, this would add columns not rows. In this
schema, read bytes for wal receiver could be counted in one way and
writes in another. We could document that, for wal receiver, the reads
are not always done in units of the same size, so the read_bytes /
reads could be thought of as an average size of read.

Even if we made a separate view for WAL I/O stats, we would still have
this issue of variable sized I/O vs block sized I/O and would probably
end up solving it with separate columns for the number of bytes and
number of operations.

> > > A separate question: suppose [1] goes in (to read WAL from WAL buffers
> > > directly). Now, WAL reads are not from permanent storage anymore. Are
> > > we only tracking permanent storage I/O in pg_stat_io? I also had this
> > > question for some of the WAL receiver functions. Should we track any
> > > I/O other than permanent storage I/O? Or did I miss this being
> > > addressed upthread?
> >
> > That's a good point. I guess that this should just be a different
> > IOOp? That's not a IOOP_READ. A IOOP_HIT is also different.
>
> I think different IOContext rather than IOOp suits better for this.

That makes sense to me.

> > > In terms of what I/O we should track in a streaming/asynchronous
> > > world, the options would be:
> > >
> > > 1) track read/write syscalls
> > > 2) track blocks of BLCKSZ submitted to the kernel
> > > 3) track bytes submitted to the kernel
> > > 4) track merged I/Os (after doing any merging in the application)
> > >
> > > I think the debate was largely between 2 and 4. There was some
> > > disagreement, but I think we landed on 2 because there is merging that
> > > can happen at many levels in the storage stack (even the storage
> > > controller). Distinguishing between whether or not Postgres submitted
> > > 2 32k I/Os or 8 8k I/Os could be useful while you are developing AIO,
> > > but I think it might be confusing for the Postgres user trying to
> > > determine why their query is slow. It probably makes the most sense to
> > > still track in block size.
> > >
> > > No matter what solution we pick, you should get a correct number if
> > > you multiply op_bytes by an IOOp (assuming nothing is NULL). Or,
> > > rather, there should be some way of getting an accurate number in
> > > bytes of the amount of a particular kind of I/O that has been done.
> >
> > Yeah, coming back to op_bytes = -1/NULL as a tweak to mean that reads,
> > writes or extends are counted as bytes, because we don't have a fixed
> > operation size for some (object,context) cases.
>
> Can't we use 2 and 3 together? For example, use 3 for the IOOBJECT_WAL
> IOs and 2 for the other IOs.

We can do this. One concern I have is that much of WAL I/O is done in
XLOG_BLCKSZ, so it feels kind of odd for all WAL I/O to appear as if
it is being done in random chunks of bytes. We anticipated other
uniformly non-block-based I/O types where having 1 in op_bytes would
be meaningful. I didn't realize at the time that there would be
variable-sized and block-sized I/O mixed together for the same backend
type, io object, and io context.

- Melanie

In response to

Re: Show WAL write and fsync stats in pg_stat_io at 2024-01-11 11:18:54 from Nazir Bilal Yavuz

Responses

Re: Show WAL write and fsync stats in pg_stat_io at 2024-01-12 13:23:26 from Nazir Bilal Yavuz

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	vignesh C	2024-01-11 14:30:27	Re: Issue in postgres_fdw causing unnecessary wait for cancel request reply
Previous Message	vignesh C	2024-01-11 14:25:52	Re: Assertion failure in SnapBuildInitialSnapshot()