Re: Resetting spilled txn statistics in pg_stat_replication

From: Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Resetting spilled txn statistics in pg_stat_replication
Date: 2020-10-13 06:46:35
Message-ID: CA+fd4k6Yn9N=+6brYZmUM4Q0Er4ymo9hu8ueyotE_hN4gmHstA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 13 Oct 2020 at 15:27, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, Oct 13, 2020 at 11:49 AM Masahiko Sawada
> <masahiko(dot)sawada(at)2ndquadrant(dot)com> wrote:
> >
> > On Tue, 13 Oct 2020 at 14:53, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Tue, Oct 13, 2020 at 11:05 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > > >
> > > > Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> writes:
> > > > >> It is possible that MAXALIGN stuff is playing a role here and or the
> > > > >> background transaction stuff. I think if we go with the idea of
> > > > >> testing spill_txns and spill_count being positive then the results
> > > > >> will be stable. I'll write a patch for that.
> > > >
> > > > Here's our first failure on a MAXALIGN-8 machine:
> > > >
> > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2020-10-13%2005%3A00%3A08
> > > >
> > > > So this is just plain not stable. It is odd though. I can
> > > > easily think of mechanisms that would cause the WAL volume
> > > > to occasionally be *more* than the "typical" case. What
> > > > would cause it to be *less*, if MAXALIGN is ruled out?
> > > >
> > >
> > > The original theory I have given above [1] which is an interleaved
> > > autovacumm transaction. Let me try to explain in a bit more detail.
> > > Say when transaction T-1 is performing Insert ('INSERT INTO stats_test
> > > SELECT 'serialize-topbig--1:'||g.i FROM generate_series(1, 5000)
> > > g(i);') a parallel autovacuum transaction occurs. The problem as seen
> > > in buildfarm will happen when autovacuum transaction happens after 80%
> > > or more of the Insert is done.
> > >
> > > In such a situation we will start decoding 'Insert' first and need to
> > > spill multiple times due to the amount of changes (more than threshold
> > > logical_decoding_work_mem) and then before we encounter Commit of
> > > transaction that performed Insert (and probably some more changes from
> > > that transaction) we will encounter a small transaction (autovacuum
> > > transaction). The decode of that small transaction will send the
> > > stats collected till now which will lead to the problem shown in
> > > buildfarm.
> >
> > That seems a possible scenario.
> >
> > I think probably this also explains the reason why spill_count
> > slightly varied and spill_txns was still 1. The spill_count value
> > depends on how much the process spilled out transactions before
> > encountering the commit of an autovacuum transaction. Since we have
> > the spill statistics per reorder buffer, not per transactions, it's
> > possible.
> >
>
> Okay, here is an updated version (changed some comments) of the patch
> I posted some time back. What do you think? I have tested this on both
> Windows and Linux environments. I think it is a bit tricky to
> reproduce the exact scenario so if you are fine we can push this and
> check or let me know if you any better idea?

I agree to check if the spill_counts and spill_txns are positive. I
thought we can reduce the number of tuples to insert to the half. It
would help to reduce the likelihood of other transactions interfere
and speed up the test (currently, the stats.sql test takes almost 1
sec in my environment). But it might lead to another problem like the
logical decoding doesn't spill out the transaction on some
environment.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2020-10-13 07:12:35 Re: Resetting spilled txn statistics in pg_stat_replication
Previous Message Amit Kapila 2020-10-13 06:27:56 Re: Resetting spilled txn statistics in pg_stat_replication