Re: Monitoring logical replication

From: Klaus Darilion <klaus(dot)mailinglists(at)pernau(dot)at>
To: Shaheed Haque <shaheedhaque(at)gmail(dot)com>
Cc: pgsql-general list <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: Monitoring logical replication
Date: 2024-01-29 23:15:06
Message-ID: fd211d7c53452d0c8ae2168039e6d55e@pernau.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Saheed!

I monitor our replication this way:

1. Every 10 seconds i fetch the current LSN and write it into a table,
next with the current timestamp. Further I fetch confirmend LSNs from
the replication slots and delete old entries in lsn2data table.

calculate_logical_replication_lag.php:

<?php

$path = realpath(dirname(__FILE__) . "/../inc");
set_include_path($path . PATH_SEPARATOR . get_include_path());

require_once('config.php');
$config_int['syslogprefix'] = basename(__FILE__);
require_once('logging.php');

$dbuser="replication_lag_user";
$dbpass="XXXXXXXXXXXXXXXXXXXX";
if (!$dbconn = pg_pconnect('host='.$config_int['dbhost'].'
dbname='.$config_int['dbname'].' user='.$dbuser.' password='.$dbpass)) {
print "Sorry, database connection failed";
exit;
}

$accuracy = 10; // in seconds

//
// Preparations:
//
// CREATE TABLE lsn2data(
// lsn pg_lsn PRIMARY KEY,
// seen timestamp NOT NULL DEFAULT NOW()
// );
// CREATE ROLE replication_lag_user WITH LOGIN PASSWORD
'XXXXXXXXXXXXXXXXXXX';
// GRANT ALL ON TABLE lsn2data TO replication_lag_user;
//
// CREATE OR REPLACE FUNCTION get_replication_lag() RETURNS TABLE
(subscriber name, lag bigint) AS
// $BODY$
// DECLARE
// subscriber name;
// BEGIN
// FOR subscriber IN
// SELECT slot_name FROM pg_replication_slots
// LOOP
// RETURN QUERY SELECT slot_name, EXTRACT(EPOCH FROM
NOW()-seen)::bigint lag from lsn2data,pg_replication_slots WHERE
slot_name=subscriber AND lsn < confirmed_flush_lsn ORDER BY lsn DESC
LIMIT 1;
// END LOOP;
// RETURN;
// END
// $BODY$
// LANGUAGE plpgsql;
//
while (1) {
$dbq = pg_query("INSERT INTO lsn2data (lsn) VALUES
(pg_current_wal_lsn())");
if ($dbq === FALSE) {
mylog(LOG_ERROR, "SQL query error:
".pg_last_error()."\n");
exit(1);
}

$dbq = pg_query("DELETE FROM lsn2data WHERE lsn < (".
"SELECT lsn FROM lsn2data WHERE lsn < (".
"SELECT confirmed_flush_lsn FROM
pg_replication_slots ORDER BY confirmed_flush_lsn ASC LIMIT 1".
") ORDER BY lsn DESC LIMIT 1".
")"
);
if ($dbq === FALSE) {
mylog(LOG_ERROR, "SQL query error:
".pg_last_error()."\n");
exit(1);
}
sleep($accuracy);
}

2. I graph the replications lags (converted from LSN to seconds) in my
check_mk monitoring:

#!/bin/bash

#
# Managed by Puppet:
modules/base/files/monitoring/check_logical_replication_lag.sh
#
# Check the logical replication lag and export performance data for each
subscriber
#

# exit on error
#set -e

#Make sure this script only runs one at a time
(

ME=$0
MEBASE=`basename $0`

mylog () {
echo "$MEBASE: $1"
logger -t "$MEBASE" "$1"
}

flock -x -w 1 200
if [ $? != "0" ]; then
#echo "ERROR: $0 is already running ... exit"
logger -t "$MEBASE" "ERROR: $0 is already running ... exit"
exit 1
fi

# Do stuff

# Variablen fuer Monitoring
CMK_SPOOLDIR=/var/lib/check_mk_agent/spool
CMK_NAME=$MEBASE
CMK_SPOOLFILE=600_`basename ${CMK_NAME}`.txt
CMK_HEADER="<<<local>>>"
TMP_FILE="/tmp/logical_replication_lag.csv"

# Schwellwerte
warn=300
crit=600

final_output="$CMK_HEADER\nP $CMK_NAME "

# move to a directory where user postgresl may reside (sudo)
cd /tmp

# Lag auslesen. Waehrend dem initialen aufsynchen eines Subscribers gibt
es temporaere Subscriptions, mit dem Namen reg_xxx1-pid-sync-pid.
# Damit diese nicht getrackt werden gibt es die huebsche LIKE Clause.
rm -f "$TMP_FILE"
sudo -u postgres psql regdns -c "COPY (SELECT subscriber,lag FROM
get_replication_lag() WHERE subscriber LIKE '%\_____' ORDER BY 2 DESC)
TO '$TMP_FILE' With CSV" 2>&1> /dev/null
LC=$(sudo -u postgres psql -t regdns -c "SELECT count(*) FROM
get_replication_lag();" | tr -d ' ')

if [ $LC == "0" ]; then
echo -e "$CMK_HEADER\n0 $CMK_NAME - No Slaves with Replication
found - maybe we are a slave?" > $CMK_SPOOLDIR/$CMK_SPOOLFILE
exit 0;
fi

grep $(hostname | cut -d '-' -f2) "$TMP_FILE" > /dev/null
if [ $? != "0" ]; then
echo -e "$CMK_HEADER\n2 $CMK_NAME - Postgres Output does not
seem valid. Please check script $ME and output in $TMP_FILE" >
$CMK_SPOOLDIR/$CMK_SPOOLFILE
exit 1;
fi

# CSV in Array einlesen
IFS=$'\n' read -d '' -r -a input_file < "$TMP_FILE"

# Auswerten
maxlag=0
for i in "${input_file[(at)]}"; do
node=`echo $i | awk -F "," '{print $1}' | tr -- _ -`
lag=`echo $i | awk -F "," '{print $2}'`
final_output="$final_output$node=$lag;$warn;$crit|"
#
https://unix.stackexchange.com/questions/186663/is-there-a-unix-command-that-gives-the-minimum-maximum-of-two-numbers
maxlag=$(( maxlag > lag ? maxlag : lag ))
done
final_output="${final_output}max-lag=$maxlag;$warn;$crit"

# Letztes Pipe Zeichen rausschneiden
#final_output=`echo $final_output | rev | cut -c 2- | rev`

# Spool File schreiben
echo -e $final_output > $CMK_SPOOLDIR/$CMK_SPOOLFILE
logger -t "$MEBASE" "$final_output"

) 200>/tmp/`basename $0`.exclusivelock

3. During initial sync I check the status on the subscriber. Once it has
synced all tables of the publication, it will send me an email.
#
# Managed by Puppet:
modules/pdns/templates/check_pglogical_subscription.sh.erb
#

#
# This script checks and eventually creates the subscription, and wait
until the initial sync is finished
#

PUB=regdns2020_pub
SLEEP=5
PREFIX=check_pglogical_subscription.sh
NUMTAB=175

SECONDS=0
date
while true; do
echo "SELECT * from pg_subscription;" | sudo -u postgres psql -t
regdns | grep -q $PUB
if [ $? -eq 0 ]; then
echo "OK: Host is subscribed to '$PUB'. Checking for
table count ..."
break
fi
echo "ERROR: Host is not subscribed to '$PUB'. Subscribing to
master ..."
logger -t $PREFIX "ERROR: Host is not subscribed to '$PUB'.
Subscribing to master ..."
echo "CREATE SUBSCRIPTION `hostname -s|tr -- - _` CONNECTION
'host=XXXXX dbname=XXXX user=XXXXX password=XXXXXX PUBLICATION
regdns2020_pub;" | sudo -u postgres psql regdns && touch
/etc/regdns.schema_subscription.created
echo "Re-Checking in $SLEEP seconds ..."
logger -t $PREFIX "Re-Checking in $SLEEP seconds ..."
sleep $SLEEP
done

while true; do
COUNT=$(echo "SELECT count(*) from pg_subscription_rel;" | sudo
-u postgres psql -t regdns | head -1 | xargs)
if [ $COUNT -eq $NUMTAB ]; then
echo "OK: Subscription '$PUB' contains $NUMTAB tables -
that is OK. Checking for initial-sync status ..."
logger -t $PREFIX "OK: Subscription '$PUB' contains
$NUMTAB tables - that is OK. Checking for initial-sync status ..."
break
fi
echo "ERROR: Subscription '$PUB' contains $COUNT tables, but
should contain $NUMTAB table. Re-Checking in $SLEEP seconds ..."
logger -t $PREFIX "ERROR: Subscription '$PUB' contains $COUNT
tables, but should contain $NUMTAB table. Re-Checking in $SLEEP seconds
..."
sleep $SLEEP
done

while true; do
COUNTFIN=$(echo "SELECT count(*) from pg_subscription_rel WHERE
srsubstate='r';" | sudo -u postgres psql -t regdns | head -1 | xargs)
if [ $COUNTFIN -eq $NUMTAB ]; then
echo "OK: Initial sync of $COUNTFIN/$NUMTAB tables
finished in $SECONDS seconds."
logger -t $PREFIX "OK: Initial sync of $COUNTFIN/$NUMTAB
tables finished in $SECONDS seconds."
echo "OK: Initial sync of $COUNTFIN/$NUMTAB tables
finished in $SECONDS seconds." | mailx -s "$HOST $SECONDS seconds to
subscribe" -- root
break
fi
echo "PROGRESS: Initial sync of $COUNTFIN/$NUMTAB tables
finished. Re-Checking in $SLEEP seconds ..."
logger -t $PREFIX "PROGRESS: Initial sync of $COUNTFIN/$NUMTAB
tables finished. Re-Checking in $SLEEP seconds ..."
sleep $SLEEP
done

regards
Klaus

Am 2023-10-07 17:31, schrieb Shaheed Haque:
> Hi,
>
> I've been playing with logical replication (currently on PG14),
> specifically in an AWS RDS Postgres context, but NOT using AWS' own
> replication tooling. I'm generally familiar with the challenges of
> distributed systems (such causality, time synchronisation etc), but
> not especially familiar with PG.
>
> In looking at how to tell how a given subscriber has caught up with
> its publisher, there is plenty of advice around the Web, for example
> https://dba.stackexchange.com/questions/314324/monitor-logical-replication-using-lsn.
> Like this example, much advice ends up talking about using separate
> queries on the publisher and the subscriber to compare LSNs. First, (I
> think) I understand the core difficulty that comparing LSNs is
> inherently racy, but given that, I'm a bit unclear as to why a single
> query on the publisher is not enough...IIUC:
>
> * Changes sent from the publisher to the subscriber are identified by
> LSN.
> * The publisher knows it's own current latest LSN
> (pg_current_wal_lsn()), but this seems not to be exposed at the
> subscriber.
> * The subscriber knows what it has applied locally and even tells the
> publisher (pg_stat_subscription.latest_end_lsn), but it does not seem
> to be exposed at the publisher.
>
> Have I missed something? Is there a way to track the LSN delta (given
> that this is known to be racy) just by querying one end?
>
> Second, how do folk "know" when replication is "done". For example, if
> the two LSNs continued to match for 1 * replication lag? Or N *
> replication lag? What would be a plausible N?
>
> Third, as we know when logical replication is started, the initial
> table state is captured in a snapshot, and sent across using COPY
> TABLE under the covers. Now, let's say that the publisher is idle
> (i.e. no SQL writes to the user's schema...obviously pg_catalog might
> change as replication is configured and enabled) and that the
> replication starts with the publisher as LSN_start. How could one know
> when the copying is done:
>
> * I initially assumed that the publisher's LSN would not change from
> LSN_start, but as the copying proceeds, I see that it DOES change
> (presumably because there are updates happening to pg_catalog, such as
> the temporary slots coming and going).
> * Is there some kind of singleton state on either publisher or
> subscriber that could be checked to know? (At the moment, I am
> counting the records in all copied tables).
>
> I realise that the knowledge that the publisher is "idle" is a special
> case, but right now, my test for being "done" is:
>
> * Number of records in copied tables matches AND the publisher's
> pg_stat_subscription matches the subscriber's
> pg_stat_subscription.latest_end_lsn.
>
> Plus or minus the bit about replication lag, is there a better way?
>
> Thanks, Shaheed

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Bruce Momjian 2024-01-30 03:56:56 Re: Best practices for data buffer cache setting/tuning (v15)
Previous Message Adrian Klaver 2024-01-29 23:02:51 Re: Scriptable way to validate a pg_dump restore ?