From: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
---|---|
To: | amit(dot)kapila16(at)gmail(dot)com |
Cc: | abbas(dot)butt(at)enterprisedb(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, zahid(dot)iqbal(at)enterprisedb(dot)com |
Subject: | Re: Logical replication keepalive flood |
Date: | 2021-06-11 01:37:47 |
Message-ID: | 20210611.103747.1303885450661824696.horikyota.ntt@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
At Thu, 10 Jun 2021 12:18:00 +0530, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote in
> Good analysis. I think this analysis has shown that walsender is
> sending messages at top speed as soon as they are generated. So, I am
> wondering why there is any need to wait/sleep in such a workload. One
> possibility that occurred to me RecentFlushPtr is not updated and or
> we are not checking it aggressively. To investigate on that lines, can
> you check the behavior with the attached patch? This is just a quick
> hack patch to test whether we need to really wait for WAL a bit
> aggressively.
Yeah, anyway the comment for the caller site of WalSndKeepalive tells
that exiting out of the function *after* there is somewhat wrong.
> * possibly are waiting for a later location. So, before sleeping, we
> * send a ping containing the flush location. If the receiver is
But I nothing changed by moving the keepalive check to after the exit
check. (loc <= RecentFlushPtr).
And the patch also doesn't change the situation so much. The average
number of loops is reduced from 3 to 2 per call but the ratio between
total records and keepalives doesn't change.
previsous: A=#total-rec = 19476, B=#keepalive=3006, B/A = 0.154
this time: A=#total-rec = 13208, B=#keepalive=1988, B/A = 0.151
Total records: 13208
reqsz: #sent/ #!sent/ #call: wr lag / fl lag
8: 4 / 4 / 4: 6448 / 268148
16: 1 / 1 / 1: 8688 / 387320
24: 1988 / 1987 / 1999: 6357 / 226163
195: 1 / 0 / 20: 408 / 1647
7477: 2 / 0 / 244: 68 / 847
8225: 1 / 1 / 1: 7208 / 7208
So I checked how many bytes RecentFlushPtr is behind requested loc if
it is not advanced enough.
Total records: 15128
reqsz: #sent/ #!sent/ #call: wr lag / fl lag / RecentFlushPtr lag
8: 2 / 2 / 2: 520 / 60640 / 8
16: 1 / 1 / 1: 8664 / 89336 / 16
24: 2290 / 2274 / 2302: 5677 / 230583 / 23
187: 1 / 0 / 40: 1 / 6118 / 1
7577: 1 / 0 / 69: 120 / 3733 / 65
8177: 1 / 1 / 1: 8288 / 8288 / 2673
So it's not a matter of RecentFlushPtr check. (Almost) Always when
WalSndWakeupRequest feels to need to send a keepalive, the function is
called before the record begins to be written.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
From | Date | Subject | |
---|---|---|---|
Next Message | Noah Misch | 2021-06-11 01:39:42 | Re: BF assertion failure on mandrill in walsender, v13 |
Previous Message | osumi.takamichi@fujitsu.com | 2021-06-11 01:27:40 | RE: locking [user] catalog tables vs 2pc vs logical rep |