Re: Resetting spilled txn statistics in pg_stat_replication

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Resetting spilled txn statistics in pg_stat_replication
Date: 2020-06-12 12:41:42
Message-ID: CABUevEwtv=a-yQ3LmvGZ1xmnSzDPsfMUTf048HfCXsdNgKnMDg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 12, 2020 at 10:23 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:

> On Fri, Jun 12, 2020 at 11:20 AM Masahiko Sawada
> <masahiko(dot)sawada(at)2ndquadrant(dot)com> wrote:
> >
> > On Fri, 12 Jun 2020 at 12:21, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> > >
> > > > Since the logical decoding intermediate files are written at per
> slots
> > > > directory, I thought that corresponding these statistics to
> > > > replication slots is also understandable for users.
> > > >
> > >
> > > What I wanted to know is how will it help users to tune
> > > logical_decoding_work_mem? Different backends can process from the
> > > same slot, so it is not clear how user will be able to make any
> > > meaning out of those stats.
> >
> > I thought that the user needs to constantly monitor them during one
> > process is executing logical decoding and to see the increments. I
> > might not fully understand but I guess the same is true for displaying
> > them w.r.t. process. Since a process can do logical decoding several
> > times using the same slot with a different setting, the user will need
> > to monitor them several times.
> >
>
> Yeah, I think we might not be able to get exact measure but if we
> divide total_size spilled by exec_count, we will get some rough idea
> of what should be the logical_decoding_work_mem for that particular
> session. For ex. consider the logical_decoding_work_mem is 100bytes
> for a particular backend and the size spilled by that backend is 100
> then I think you can roughly keep it to 200bytes if you want to avoid
> spilling. Similarly one can compute its average value over multiple
> executions. Does this make sense to you?
>

The thing that becomes really interesting is to analyze this across time.
For example to identify patterns where it always spills at the same time as
certain other things are happening. For that usecase, having a "predictable
persistence" is important. You may not be able to afford setting
logical_decoding_work_mem high enough to cover every possible scenario (if
you did, then we would basically not need the spilling..), so you want to
track down in relation to the rest of your application exactly when and how
this is happening.

>
> > > OTOH, it is easier to see how to make
> > > meaning of these stats if we display them w.r.t process. Basically,
> > > we have spill_count and spill_size which can be used to tune
> > > logical_decoding_work_mem and also the activity of spilling happens at
> > > process level, so it sounds like one-to-one mapping.
> >
> > Displaying them w.r.t process also seems a good idea but I'm still
> > unclear what to display and how long these values are valid.
> >
>
> I feel till the lifetime of a process if we want to display the values
> at process level but I am open to hear others (including yours) views
> on this.
>

The problem with "lifetime of a process" is that it's not predictable. A
replication process might "bounce" for any reason, and it is normally not a
problem. But if you suddenly lose your stats when you do that, it starts to
matter a lot more. Especially when you don't know if it bounced. (Sure you
can look at the backend_start time, but that adds a whole different sets of
complexitites).

> > The view

> > will have the following columns for example?
> >
> > * pid
> > * slot_name
> > * spill_txns
> > * spill_count
> > * spill_bytes
> > * exec_count
> >
>
> Yeah, these appear to be what I have in mind. Note that we can have
> multiple entries of the same pid here because of slotname, there is
> some value to display slotname but I am not completely sure if that is
> a good idea but I am fine if you have a reason to include slotname?
>

Well, it's a general view so you can always GROUP BY that away if you want
at reading point?

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Gierth 2020-06-12 12:47:19 Re: Infinities in type numeric
Previous Message Ashutosh Bapat 2020-06-12 12:22:28 Re: Getting ERROR with FOR UPDATE/SHARE for partitioned table.