Re: Replication slot stats misgivings

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Subject: Re: Replication slot stats misgivings
Date: 2021-04-10 01:54:02
Message-ID: CAD21AoBn9RzmrgsOQ0RVZ2=ON99OeKSZYTj+eUp-HjrMJw8-tA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 8, 2021 at 7:30 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Apr 7, 2021 at 2:51 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> >
>
> @@ -4069,6 +4069,24 @@ pgstat_read_statsfiles(Oid onlydb, bool
> permanent, bool deep)
> * slot follows.
> */
> case 'R':
> + /*
> + * There is a remote scenario where one of the replication slots
> + * is dropped and the drop slot statistics message is not
> + * received by the statistic collector process, now if the
> + * max_replication_slots is reduced to the actual number of
> + * replication slots that are in use and the server is
> + * re-started then the statistics process will not be aware of
> + * this. To avoid writing beyond the max_replication_slots
> + * this replication slot statistic information will be skipped.
> + */
> + if (max_replication_slots == nReplSlotStats)
> + {
> + ereport(pgStatRunningInCollector ? LOG : WARNING,
> + (errmsg("skipping \"%s\" replication slot statistics as
> pg_stat_replication_slots does not have enough slots",
> + NameStr(replSlotStats[nReplSlotStats].slotname))));
> + goto done;
> + }
>
> I think we might truncate some valid slots here. I have another idea
> to fix this case which is that while writing, we first write the
> 'nReplSlotStats' and then write each slot info. Then while reading we
> can allocate memory based on the required number of slots. Later when
> startup process sends the slots, we can remove the already dropped
> slots from this array. What do you think?

IIUC there are two problems in the case where the drop message is lost:

1. Writing beyond the end of replSlotStats.
This can happen if after restarting the number of slots whose stats
are stored in the stats file exceeds max_replication_slots. Vignesh's
patch addresses this problem.

2. The stats for the new slot are not recorded.
If the stats for already-dropped slots remain in replSlotStats, the
stats for the new slot cannot be registered due to the full of
replSlotStats. This can happen even when after restarting the number
of slots whose stats are stored in the stat file does NOT exceed
max_replication_slots as well as even during the server running. The
patch doesn’t address this problem. (If this happens, we will have to
reset all slot stats since pg_stat_reset_replication_slot() cannot
remove the slot stats with the non-existing name).

I think we can use HTAB to store slot stats and have
pg_stat_get_replication_slot() inquire about stats by the slot name,
resolving both problems. By using HTAB we're no longer concerned about
the problem of writing stats beyond the end of the replSlotStats
array. Instead, we have to consider how and when to clean up the stats
for already-dropped slots. We can have the startup process send slot
names at startup time, which borrows the idea proposed by Amit. But
maybe we need to consider the case again where the message from the
startup process is lost? Another idea would be to have
pgstat_vacuum_stat() check the existing slots and call
pgstat_report_replslot_drop() if the slot in the stats file doesn't
exist. That way, we can continuously check the stats for
already-dropped slots.

I've written a PoC patch for the above idea; using HTAB and cleaning
up slot stats at pgstat_vacuum_stat(). The patch can be applied on top
of 0001 patch Vignesh proposed before[1].

Please note that this cannot resolve the problem of ending up
accumulating the stats to the old slot if the slot is re-created with
the same name and the drop message is lost. To deal with this problem
I think we would need to use something unique identifier for each slot
instead of slot name.

[1] https://www.postgresql.org/message-id/CALDaNm195xL1bZq4VHKt%3D-wmXJ5kC4jxKh7LXK%2BpN7ESFjHO%2Bw%40mail.gmail.com

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachment Content-Type Size
0001-POC-Use-HTAB-for-replication-slot-stats.patch application/octet-stream 32.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2021-04-10 02:32:35 Re: Avoid unnecessary table open/close for TRUNCATE foo, foo, foo; kind of commands
Previous Message Justin Pryzby 2021-04-10 01:39:19 Re: Table refer leak in logical replication