Re: Monitoring logical replication

From: Shaheed Haque <shaheedhaque(at)gmail(dot)com>
To: pgsql-general list <pgsql-general(at)lists(dot)postgresql(dot)org>
Cc: Klaus Darilion <klaus(dot)mailinglists(at)pernau(dot)at>
Subject: Re: Monitoring logical replication
Date: 2024-06-18 09:03:19
Message-ID: CAHAc2jccg964PHjxkmyD6XgknS2umxGsuNtVULo2P76N+uvvHQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi all,

Is there an "official" pairing of LSN values on the publication and
subscription sides that should be used to track the delta between the two
systems? I ask because Google is full of different pairs being used. I
tried to identify the highest level interface points exposed, i.e. what is
documented on
https://www.postgresql.org/docs/current/replication-origins.html, the
pg_stat_subscription table, the pg_stat_publication table and the
pg_current_wal_lsn() function on the publisher, but these seem to be barely
used.

Am I missing something?

Thanks, Shaheed

P.S. On a related note, I see a (stalled?) discussion on providing LSN ->
timestamp conversion
<https://www.postgresql.org/message-id/flat/CAAKRu_bw7Pgw8Mi9LJrBkFvPPHgvVjPphrT8ugbzs-2V0f%2B1Rw%40mail.gmail.com#8540282228634ecd061585867c6275ca>,
I'd just like to say that something like that would be very useful.

On Tue, 30 Jan 2024 at 11:27, Shaheed Haque <shaheedhaque(at)gmail(dot)com> wrote:

> This is great, thank you for posting. I'm currently a subcontinent or two
> away from my dev env, but will compare your approach with mine (you are
> using some facilities of psql I'm not familiar with). At least you have
> confirmed that LSNs are the place to start.
>
> Thanks again, Shaheed
>
>
> On Tue, 30 Jan 2024, 05:15 Klaus Darilion, <klaus(dot)mailinglists(at)pernau(dot)at>
> wrote:
>
>> Hi Saheed!
>>
>> I monitor our replication this way:
>>
>> 1. Every 10 seconds i fetch the current LSN and write it into a table,
>> next with the current timestamp. Further I fetch confirmend LSNs from
>> the replication slots and delete old entries in lsn2data table.
>>
>> calculate_logical_replication_lag.php:
>>
>> <?php
>>
>> $path = realpath(dirname(__FILE__) . "/../inc");
>> set_include_path($path . PATH_SEPARATOR . get_include_path());
>>
>> require_once('config.php');
>> $config_int['syslogprefix'] = basename(__FILE__);
>> require_once('logging.php');
>>
>> $dbuser="replication_lag_user";
>> $dbpass="XXXXXXXXXXXXXXXXXXXX";
>> if (!$dbconn = pg_pconnect('host='.$config_int['dbhost'].'
>> dbname='.$config_int['dbname'].' user='.$dbuser.' password='.$dbpass)) {
>> print "Sorry, database connection failed";
>> exit;
>> }
>>
>> $accuracy = 10; // in seconds
>>
>> //
>> // Preparations:
>> //
>> // CREATE TABLE lsn2data(
>> // lsn pg_lsn PRIMARY KEY,
>> // seen timestamp NOT NULL DEFAULT NOW()
>> // );
>> // CREATE ROLE replication_lag_user WITH LOGIN PASSWORD
>> 'XXXXXXXXXXXXXXXXXXX';
>> // GRANT ALL ON TABLE lsn2data TO replication_lag_user;
>> //
>> // CREATE OR REPLACE FUNCTION get_replication_lag() RETURNS TABLE
>> (subscriber name, lag bigint) AS
>> // $BODY$
>> // DECLARE
>> // subscriber name;
>> // BEGIN
>> // FOR subscriber IN
>> // SELECT slot_name FROM pg_replication_slots
>> // LOOP
>> // RETURN QUERY SELECT slot_name, EXTRACT(EPOCH FROM
>> NOW()-seen)::bigint lag from lsn2data,pg_replication_slots WHERE
>> slot_name=subscriber AND lsn < confirmed_flush_lsn ORDER BY lsn DESC
>> LIMIT 1;
>> // END LOOP;
>> // RETURN;
>> // END
>> // $BODY$
>> // LANGUAGE plpgsql;
>> //
>> while (1) {
>> $dbq = pg_query("INSERT INTO lsn2data (lsn) VALUES
>> (pg_current_wal_lsn())");
>> if ($dbq === FALSE) {
>> mylog(LOG_ERROR, "SQL query error:
>> ".pg_last_error()."\n");
>> exit(1);
>> }
>>
>> $dbq = pg_query("DELETE FROM lsn2data WHERE lsn < (".
>> "SELECT lsn FROM lsn2data WHERE lsn < (".
>> "SELECT confirmed_flush_lsn FROM
>> pg_replication_slots ORDER BY confirmed_flush_lsn ASC LIMIT 1".
>> ") ORDER BY lsn DESC LIMIT 1".
>> ")"
>> );
>> if ($dbq === FALSE) {
>> mylog(LOG_ERROR, "SQL query error:
>> ".pg_last_error()."\n");
>> exit(1);
>> }
>> sleep($accuracy);
>> }
>>
>> 2. I graph the replications lags (converted from LSN to seconds) in my
>> check_mk monitoring:
>>
>> #!/bin/bash
>>
>> #
>> # Managed by Puppet:
>> modules/base/files/monitoring/check_logical_replication_lag.sh
>> #
>> # Check the logical replication lag and export performance data for each
>> subscriber
>> #
>>
>> # exit on error
>> #set -e
>>
>> #Make sure this script only runs one at a time
>> (
>>
>> ME=$0
>> MEBASE=`basename $0`
>>
>> mylog () {
>> echo "$MEBASE: $1"
>> logger -t "$MEBASE" "$1"
>> }
>>
>> flock -x -w 1 200
>> if [ $? != "0" ]; then
>> #echo "ERROR: $0 is already running ... exit"
>> logger -t "$MEBASE" "ERROR: $0 is already running ... exit"
>> exit 1
>> fi
>>
>> # Do stuff
>>
>> # Variablen fuer Monitoring
>> CMK_SPOOLDIR=/var/lib/check_mk_agent/spool
>> CMK_NAME=$MEBASE
>> CMK_SPOOLFILE=600_`basename ${CMK_NAME}`.txt
>> CMK_HEADER="<<<local>>>"
>> TMP_FILE="/tmp/logical_replication_lag.csv"
>>
>> # Schwellwerte
>> warn=300
>> crit=600
>>
>> final_output="$CMK_HEADER\nP $CMK_NAME "
>>
>> # move to a directory where user postgresl may reside (sudo)
>> cd /tmp
>>
>> # Lag auslesen. Waehrend dem initialen aufsynchen eines Subscribers gibt
>> es temporaere Subscriptions, mit dem Namen reg_xxx1-pid-sync-pid.
>> # Damit diese nicht getrackt werden gibt es die huebsche LIKE Clause.
>> rm -f "$TMP_FILE"
>> sudo -u postgres psql regdns -c "COPY (SELECT subscriber,lag FROM
>> get_replication_lag() WHERE subscriber LIKE '%\_____' ORDER BY 2 DESC)
>> TO '$TMP_FILE' With CSV" 2>&1> /dev/null
>> LC=$(sudo -u postgres psql -t regdns -c "SELECT count(*) FROM
>> get_replication_lag();" | tr -d ' ')
>>
>> if [ $LC == "0" ]; then
>> echo -e "$CMK_HEADER\n0 $CMK_NAME - No Slaves with Replication
>> found - maybe we are a slave?" > $CMK_SPOOLDIR/$CMK_SPOOLFILE
>> exit 0;
>> fi
>>
>> grep $(hostname | cut -d '-' -f2) "$TMP_FILE" > /dev/null
>> if [ $? != "0" ]; then
>> echo -e "$CMK_HEADER\n2 $CMK_NAME - Postgres Output does not
>> seem valid. Please check script $ME and output in $TMP_FILE" >
>> $CMK_SPOOLDIR/$CMK_SPOOLFILE
>> exit 1;
>> fi
>>
>> # CSV in Array einlesen
>> IFS=$'\n' read -d '' -r -a input_file < "$TMP_FILE"
>>
>> # Auswerten
>> maxlag=0
>> for i in "${input_file[(at)]}"; do
>> node=`echo $i | awk -F "," '{print $1}' | tr -- _ -`
>> lag=`echo $i | awk -F "," '{print $2}'`
>> final_output="$final_output$node=$lag;$warn;$crit|"
>> #
>>
>> https://unix.stackexchange.com/questions/186663/is-there-a-unix-command-that-gives-the-minimum-maximum-of-two-numbers
>> maxlag=$(( maxlag > lag ? maxlag : lag ))
>> done
>> final_output="${final_output}max-lag=$maxlag;$warn;$crit"
>>
>> # Letztes Pipe Zeichen rausschneiden
>> #final_output=`echo $final_output | rev | cut -c 2- | rev`
>>
>> # Spool File schreiben
>> echo -e $final_output > $CMK_SPOOLDIR/$CMK_SPOOLFILE
>> logger -t "$MEBASE" "$final_output"
>>
>>
>> ) 200>/tmp/`basename $0`.exclusivelock
>>
>>
>> 3. During initial sync I check the status on the subscriber. Once it has
>> synced all tables of the publication, it will send me an email.
>> #
>> # Managed by Puppet:
>> modules/pdns/templates/check_pglogical_subscription.sh.erb
>> #
>>
>> #
>> # This script checks and eventually creates the subscription, and wait
>> until the initial sync is finished
>> #
>>
>> PUB=regdns2020_pub
>> SLEEP=5
>> PREFIX=check_pglogical_subscription.sh
>> NUMTAB=175
>>
>> SECONDS=0
>> date
>> while true; do
>> echo "SELECT * from pg_subscription;" | sudo -u postgres psql -t
>> regdns | grep -q $PUB
>> if [ $? -eq 0 ]; then
>> echo "OK: Host is subscribed to '$PUB'. Checking for
>> table count ..."
>> break
>> fi
>> echo "ERROR: Host is not subscribed to '$PUB'. Subscribing to
>> master ..."
>> logger -t $PREFIX "ERROR: Host is not subscribed to '$PUB'.
>> Subscribing to master ..."
>> echo "CREATE SUBSCRIPTION `hostname -s|tr -- - _` CONNECTION
>> 'host=XXXXX dbname=XXXX user=XXXXX password=XXXXXX PUBLICATION
>> regdns2020_pub;" | sudo -u postgres psql regdns && touch
>> /etc/regdns.schema_subscription.created
>> echo "Re-Checking in $SLEEP seconds ..."
>> logger -t $PREFIX "Re-Checking in $SLEEP seconds ..."
>> sleep $SLEEP
>> done
>>
>> while true; do
>> COUNT=$(echo "SELECT count(*) from pg_subscription_rel;" | sudo
>> -u postgres psql -t regdns | head -1 | xargs)
>> if [ $COUNT -eq $NUMTAB ]; then
>> echo "OK: Subscription '$PUB' contains $NUMTAB tables -
>> that is OK. Checking for initial-sync status ..."
>> logger -t $PREFIX "OK: Subscription '$PUB' contains
>> $NUMTAB tables - that is OK. Checking for initial-sync status ..."
>> break
>> fi
>> echo "ERROR: Subscription '$PUB' contains $COUNT tables, but
>> should contain $NUMTAB table. Re-Checking in $SLEEP seconds ..."
>> logger -t $PREFIX "ERROR: Subscription '$PUB' contains $COUNT
>> tables, but should contain $NUMTAB table. Re-Checking in $SLEEP seconds
>> ..."
>> sleep $SLEEP
>> done
>>
>> while true; do
>> COUNTFIN=$(echo "SELECT count(*) from pg_subscription_rel WHERE
>> srsubstate='r';" | sudo -u postgres psql -t regdns | head -1 | xargs)
>> if [ $COUNTFIN -eq $NUMTAB ]; then
>> echo "OK: Initial sync of $COUNTFIN/$NUMTAB tables
>> finished in $SECONDS seconds."
>> logger -t $PREFIX "OK: Initial sync of $COUNTFIN/$NUMTAB
>> tables finished in $SECONDS seconds."
>> echo "OK: Initial sync of $COUNTFIN/$NUMTAB tables
>> finished in $SECONDS seconds." | mailx -s "$HOST $SECONDS seconds to
>> subscribe" -- root
>> break
>> fi
>> echo "PROGRESS: Initial sync of $COUNTFIN/$NUMTAB tables
>> finished. Re-Checking in $SLEEP seconds ..."
>> logger -t $PREFIX "PROGRESS: Initial sync of $COUNTFIN/$NUMTAB
>> tables finished. Re-Checking in $SLEEP seconds ..."
>> sleep $SLEEP
>> done
>>
>>
>>
>> regards
>> Klaus
>>
>>
>>
>>
>> Am 2023-10-07 17:31, schrieb Shaheed Haque:
>> > Hi,
>> >
>> > I've been playing with logical replication (currently on PG14),
>> > specifically in an AWS RDS Postgres context, but NOT using AWS' own
>> > replication tooling. I'm generally familiar with the challenges of
>> > distributed systems (such causality, time synchronisation etc), but
>> > not especially familiar with PG.
>> >
>> > In looking at how to tell how a given subscriber has caught up with
>> > its publisher, there is plenty of advice around the Web, for example
>> >
>> https://dba.stackexchange.com/questions/314324/monitor-logical-replication-using-lsn
>> .
>> > Like this example, much advice ends up talking about using separate
>> > queries on the publisher and the subscriber to compare LSNs. First, (I
>> > think) I understand the core difficulty that comparing LSNs is
>> > inherently racy, but given that, I'm a bit unclear as to why a single
>> > query on the publisher is not enough...IIUC:
>> >
>> > * Changes sent from the publisher to the subscriber are
>> identified by
>> > LSN.
>> > * The publisher knows it's own current latest LSN
>> > (pg_current_wal_lsn()), but this seems not to be exposed at the
>> > subscriber.
>> > * The subscriber knows what it has applied locally and even tells
>> the
>> > publisher (pg_stat_subscription.latest_end_lsn), but it does not seem
>> > to be exposed at the publisher.
>> >
>> > Have I missed something? Is there a way to track the LSN delta (given
>> > that this is known to be racy) just by querying one end?
>> >
>> > Second, how do folk "know" when replication is "done". For example, if
>> > the two LSNs continued to match for 1 * replication lag? Or N *
>> > replication lag? What would be a plausible N?
>> >
>> > Third, as we know when logical replication is started, the initial
>> > table state is captured in a snapshot, and sent across using COPY
>> > TABLE under the covers. Now, let's say that the publisher is idle
>> > (i.e. no SQL writes to the user's schema...obviously pg_catalog might
>> > change as replication is configured and enabled) and that the
>> > replication starts with the publisher as LSN_start. How could one know
>> > when the copying is done:
>> >
>> > * I initially assumed that the publisher's LSN would not change
>> from
>> > LSN_start, but as the copying proceeds, I see that it DOES change
>> > (presumably because there are updates happening to pg_catalog, such as
>> > the temporary slots coming and going).
>> > * Is there some kind of singleton state on either publisher or
>> > subscriber that could be checked to know? (At the moment, I am
>> > counting the records in all copied tables).
>> >
>> > I realise that the knowledge that the publisher is "idle" is a special
>> > case, but right now, my test for being "done" is:
>> >
>> > * Number of records in copied tables matches AND the publisher's
>> > pg_stat_subscription matches the subscriber's
>> > pg_stat_subscription.latest_end_lsn.
>> >
>> > Plus or minus the bit about replication lag, is there a better way?
>> >
>> > Thanks, Shaheed
>>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Alvaro Herrera 2024-06-18 09:46:27 Re: How to attach partition with primary key
Previous Message Dominique Devienne 2024-06-18 08:35:21 set search_path "$owner". And name versus literal for schemas.