Re: Logical replication monitoring

From: Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com>
To: pgsql-admin(at)lists(dot)postgresql(dot)org
Subject: Re: Logical replication monitoring
Date: 2018-11-23 14:56:55
Message-ID: 331d46b8-8e04-6ed6-d86a-fbc6ff1a9af1@matrix.gatewaynet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

On 23/11/18 4:31 μ.μ., AYahorau(at)ibagroup(dot)eu wrote:
> Hello,
> Thank you again for the suggestion. I configured tail_n_mail on my SLES12 machine as follows as everything was ok:
>
> LOG_LINE_PREFIX: '%t %e '
> EMAIL: someone(at)example(dot)com
> MAILSUBJECT: Acme HOST Postgres errors UNIQUE : NUMBER
> INCLUDE: ERROR:  terminating logical replication worker due to timeout
> INCLUDE: LOG:  worker process: logical replication worker for subscription [0-9]+ \({1}PID [0-9]+\){1} exited with exit code [0-9]+
> FILE: /var/lib/pgsql/pg_log/LATEST
>
> So I configured *LOG_LINE_PREFIX*  equal to *log_prefix_line *from postgresql.conf.
> So tail_n_mail was able to catch such entries as this one:
> 2018-11-21 12:41:32 *FET *00000 LOG:  worker process: logical replication worker for subscription 16386 (PID 31777) exited with exit code 1
>
> But after I have updated my SLES12 to SP3 tail_n_mail stopped noticing these entry.
> 2018-11-21 12:41:32 *+03 *00000 LOG:  worker process: logical replication worker for subscription 16386 (PID 31777) exited with exit code 1
> The reason is that  in different output of  timezone: *FET *was before and *+03* is now.

Maybe (due to the upgrade) you are missing some package having to do with timezone data?

>
>
> I have two questions in this regard.
> I understand that currently tail_n_mail is not completely perfect detector. Is there any other robust, reliable way of logical replication monitoring?

tail_n_mail is the only one I found. In my case it doesn't support : %c (session id), I filed an issue.
Or, you could easily hack the source to fix this regexp from \\w\\w\\w\\w? into smth that matches +03 (e.g. \\+\\d\\d)  and make it work.

> Is it possible to configure any particular timestamp format  in postgresql  for its logs? As far as I know it can be only turned on or off depending on*%t* in *log_prefix_line.
> *
> Thank you in advance,
> Andrei Yahorau
>
>
>
> From: Andrei Yahorau/IBA
> To: pgsql-admin(at)postgresql(dot)org,
> Cc: Mikalai Keida/IBA(at)IBA
> Date: 24/08/2018 11:49
> Subject: Re: Logical replication monitoring
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>
> Hello,
> Thank you for the suggestion.
> I increased wal_receiver_timeout , wal_sender_timeout parameters and now this error does not occur.
>
> I installed tail_n_mail utility, made a simple config started in debug mode.
> I am constantly facing the same error:
>
> WARNING! Skipping non-existent file "/var/lib/pgsql/pg_log/postgresql.log-2018-08-23_154034"
> Too many loops (20161): bailing:
>
> The configuration file tail_n_mail.conf is quiet standart:
> EMAIL: someone(at)example(dot)com
> PGLOG: log
> MAILSUBJECT: Acme HOST Postgres errors UNIQUE : NUMBER
> INCLUDE: ERROR:
> INCLUDE: FATAL:
> INCLUDE: PANIC:
> FILE1: /var/lib/pgsql/pg_log/postgresql.log-%Y-%m-%d_%H%M%S
> LASTFILE1: /var/lib/pgsql/pg_log/postgresql.log-2018-08-23_154034
>
> Could you please say is there anything wrong in my configuration or script usage?
>
> Thank you,
> Andrei Yahorau
>
>
>
>
> From: Andrei Yahorau/IBA
> To: pgsql-admin(at)postgresql(dot)org,
> Cc: Mikalai Keida/IBA(at)IBA
> Date: 13/08/2018 13:16
> Subject: Re: Logical replication monitoring
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>
> Hello!
>
> Thank you for your suggestion.
> I  afraid this approach is not suitable for me. As a rule my postgresql log  on subscriber side contains a  bunch of the following entries:
>
> *ERROR:  terminating logical replication worker due to timeout*
> *00000 LOG:  worker process: logical replication worker for subscription 24578 (PID 6217) exited with exit code 1*
> *
> *How should I handle this situation? *
> *As I understand this is quite normal situation. But why is severity for it  an ERROR ?
>
> I have another assumption. Could you correct me if I am wrong.
> I found out in the source code that logical replication worker termination depends on *wal_receiver_timeout *paramer.
> So I propose setting *wal_receiver_timeout *to 0.
> In this case I think that monitoring of the following views *pg_stat_subscription, pg_publication *and *pg_stat_replication * is enough.
> In case if there  is some problem with connection or with replication *pg_stat_replication * will show nothing because wal sender will not be working otherwise it will give some information.
> Am I right? Are there any vulnerabilities in this approach ?
>
> Best regards,
> Andrei Yahorau
>
>
>
>
> From: Andrei Yahorau/IBA
> To: pgsql-admin(at)postgresql(dot)org,
> Cc: Mikalai Keida/IBA(at)IBA
> Date: 10/08/2018 13:05
> Subject: Logical replication monitoring
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>
>
> Hello PostgreSQL Community!
>
> I configured logical replication for PostgreSQL 10.4 on 2 machines, set wal_level to logical, created a publication on master node and created a subscription on standby node according to the
> PostgreSQL documentation.
> Could you please suggest an approach for replication state monitoring.
>
> According to my experience the monitoring of *pg_stat_subscription *and *pg_publication, pg_replication_slots *unfortunately is not enough for this aim. Moreover standby database does not prohibit
> write operations by default and it can lead to some inconsistency between these databases.
>
> For example a chain of queries as
> *SELECT pg_is_is_recovery()*,
> *SELECT * FROM pg_stat_replication* and
> *SELECT * FROM pg_stat_wal_receiver*
> provide insight into replication state for hot_standby replication.
>
> So is there a reliable way of replication state monitoring for logical replication?
>
> Best regards,
> Andrei Yahorau

--
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Stephen Frost 2018-11-23 15:17:25 Re: User Authentication: LDAP and "local" accounts concurrently ?
Previous Message Tom Lane 2018-11-23 14:47:36 Re: User Authentication: LDAP and "local" accounts concurrently ?