Re: Logical replication monitoring

From: AYahorau(at)ibagroup(dot)eu
To: pgsql-admin(at)postgresql(dot)org
Cc: MikalaiKeida(at)ibagroup(dot)eu
Subject: Re: Logical replication monitoring
Date: 2018-11-23 14:31:18
Message-ID: OF4FA86A2F.76D55182-ON4325834E.004D8E2E-4325834E.004FC51F@iba.by
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hello,
Thank you again for the suggestion. I configured tail_n_mail on my SLES12
machine as follows as everything was ok:

LOG_LINE_PREFIX: '%t %e '
EMAIL: someone(at)example(dot)com
MAILSUBJECT: Acme HOST Postgres errors UNIQUE : NUMBER
INCLUDE: ERROR: terminating logical replication worker due to timeout
INCLUDE: LOG: worker process: logical replication worker for subscription
[0-9]+ \({1}PID [0-9]+\){1} exited with exit code [0-9]+
FILE: /var/lib/pgsql/pg_log/LATEST

So I configured LOG_LINE_PREFIX equal to log_prefix_line from
postgresql.conf.
So tail_n_mail was able to catch such entries as this one:
2018-11-21 12:41:32 FET 00000 LOG: worker process: logical replication
worker for subscription 16386 (PID 31777) exited with exit code 1

But after I have updated my SLES12 to SP3 tail_n_mail stopped noticing
these entry.
2018-11-21 12:41:32 +03 00000 LOG: worker process: logical replication
worker for subscription 16386 (PID 31777) exited with exit code 1
The reason is that in different output of timezone: FET was before and
+03 is now.

I have two questions in this regard.
I understand that currently tail_n_mail is not completely perfect
detector. Is there any other robust, reliable way of logical replication
monitoring?
Is it possible to configure any particular timestamp format in postgresql
for its logs? As far as I know it can be only turned on or off depending
on %t in log_prefix_line.

Thank you in advance,
Andrei Yahorau

From: Andrei Yahorau/IBA
To: pgsql-admin(at)postgresql(dot)org,
Cc: Mikalai Keida/IBA(at)IBA
Date: 24/08/2018 11:49
Subject: Re: Logical replication monitoring

Hello,
Thank you for the suggestion.
I increased wal_receiver_timeout , wal_sender_timeout parameters and now
this error does not occur.

I installed tail_n_mail utility, made a simple config started in debug
mode.
I am constantly facing the same error:

WARNING! Skipping non-existent file
"/var/lib/pgsql/pg_log/postgresql.log-2018-08-23_154034"
Too many loops (20161): bailing:

The configuration file tail_n_mail.conf is quiet standart:
EMAIL: someone(at)example(dot)com
PGLOG: log
MAILSUBJECT: Acme HOST Postgres errors UNIQUE : NUMBER
INCLUDE: ERROR:
INCLUDE: FATAL:
INCLUDE: PANIC:
FILE1: /var/lib/pgsql/pg_log/postgresql.log-%Y-%m-%d_%H%M%S
LASTFILE1: /var/lib/pgsql/pg_log/postgresql.log-2018-08-23_154034

Could you please say is there anything wrong in my configuration or script
usage?

Thank you,
Andrei Yahorau

From: Andrei Yahorau/IBA
To: pgsql-admin(at)postgresql(dot)org,
Cc: Mikalai Keida/IBA(at)IBA
Date: 13/08/2018 13:16
Subject: Re: Logical replication monitoring

Hello!

Thank you for your suggestion.
I afraid this approach is not suitable for me. As a rule my postgresql
log on subscriber side contains a bunch of the following entries:

ERROR: terminating logical replication worker due to timeout
00000 LOG: worker process: logical replication worker for subscription
24578 (PID 6217) exited with exit code 1

How should I handle this situation?
As I understand this is quite normal situation. But why is severity for it
an ERROR ?

I have another assumption. Could you correct me if I am wrong.
I found out in the source code that logical replication worker termination
depends on wal_receiver_timeout paramer.
So I propose setting wal_receiver_timeout to 0.
In this case I think that monitoring of the following views
pg_stat_subscription, pg_publication and pg_stat_replication is enough.
In case if there is some problem with connection or with replication
pg_stat_replication will show nothing because wal sender will not be
working otherwise it will give some information.
Am I right? Are there any vulnerabilities in this approach ?

Best regards,
Andrei Yahorau

From: Andrei Yahorau/IBA
To: pgsql-admin(at)postgresql(dot)org,
Cc: Mikalai Keida/IBA(at)IBA
Date: 10/08/2018 13:05
Subject: Logical replication monitoring

Hello PostgreSQL Community!

I configured logical replication for PostgreSQL 10.4 on 2 machines, set
wal_level to logical, created a publication on master node and created a
subscription on standby node according to the PostgreSQL documentation.
Could you please suggest an approach for replication state monitoring.

According to my experience the monitoring of pg_stat_subscription and
pg_publication, pg_replication_slots unfortunately is not enough for this
aim. Moreover standby database does not prohibit write operations by
default and it can lead to some inconsistency between these databases.

For example a chain of queries as
SELECT pg_is_is_recovery(),
SELECT * FROM pg_stat_replication and
SELECT * FROM pg_stat_wal_receiver
provide insight into replication state for hot_standby replication.

So is there a reliable way of replication state monitoring for logical
replication?

Best regards,
Andrei Yahorau

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Lentes, Bernd 2018-11-23 14:36:42 User Authentication: LDAP and "local" accounts concurrently ?
Previous Message Achilleas Mantzios 2018-11-23 12:13:26 Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC in pg_wal "No space left on device"