Replication stops under certain circumstances

From: Kim Rose Carlsen <krc(at)hiper(dot)dk>
To: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Replication stops under certain circumstances
Date: 2017-10-20 17:39:26
Message-ID: VI1PR05MB170908E04964213E8C2F3837C7430@VI1PR05MB1709.eurprd05.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi

I have some strange issues with a postgresql read replica that seems to stop replicating under certain circumstances.

Whenever we have changes to our views we have script that drops all views and reload them from scratch with the new definitions. The reloading of the views happens in a transaction to avoid confusion for everyone using the database. When this update gets to the slave it seems there is a chance for a deadlock to occur that doesn't get detected.

As I was trying to reproduce this behavior, I ran into another weird situation that I don't entirely understand. The symptom is the same that replication stops, but it looks quite different. This example won't reproduce the issue reliably, but after a few hours I get a slave that won't continue to replicate until I restart it. The queries in the example won't make much sense, and I don't know if they can be simplified further and still cause the "desired" effect.

Setup:
Launch a new RDS psql instance (9.6.2) on AWS (will be referred to as db-master) and create a read replica (will be referred to as db-slave). The following options are changed from AWS default:
 
max_standby_streaming_delay=-1
hot_standby_feedback=1

On the master create 2 dummy tables:
create table a  (id serial primary key);
create table b  (id serial primary key);

Setup thread 1 to do work on master:

while true; do psql -h db-master -U postgres db -c 'begin; drop view if exists view_a cascade; drop view if exists view_b; drop view if exists view_c; create view view_a as select * from a; create view view_b as select * from b; create view view_c as select * from view_a join view_b using (id); insert into a values (default); insert into b values (default); commit;'; done

Setup thread 2 to do work on Slave:
while true; do psql -h  db-slave -U postgres db -c 'begin; select * from view_c order by random() limit 10; select * from view_a order by random() limit 10;'; done

Setup thread 3 to do more work on slave:
while true; do psql -h  db-slave -U postgres db -c 'begin; select * from view_b order by random() limit 10; select * from view_a order by random() limit 10;'; done

Every now and then a deadlock is detected and one connection is aborted, this works as expected. But After a while(serveral hours) it becomes impossible to connect to db on db-slave and thread 2 and 3 stops producing output. When trying to connect the psql client just hangs. However it is possible connect to template1 database to get a look on what is going on.

template1=> select * from pg_stat_activity;
-[ RECORD 1 ]----+--------------------------------
datid            | 16384
datname          | rdsadmin
pid              | 7891
usesysid         | 10
usename          | rdsadmin
application_name |
client_addr      |
client_hostname  |
client_port      |
backend_start    |
xact_start       |
query_start      |
state_change     |
wait_event_type  |
wait_event       |
state            |
backend_xid      |
backend_xmin     |
query            | <insufficient privilege>
-[ RECORD 2 ]----+--------------------------------
datid            | 1
datname          | template1
pid              | 11949
usesysid         | 16388
usename          | hiper
application_name | psql
client_addr      | 192.168.10.166
client_hostname  |
client_port      | 41002
backend_start    | 2017-10-20 16:30:26.032745+02
xact_start       | 2017-10-20 16:30:34.306418+02
query_start      | 2017-10-20 16:30:34.306418+02
state_change     | 2017-10-20 16:30:34.306421+02
wait_event_type  |
wait_event       |
state            | active
backend_xid      |
backend_xmin     | 26891
query            | select * from pg_stat_activity;

There are no active connection except rdsadmin from aws.

template1=> select * from pg_locks;
  locktype  | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction |  pid  |        mode         | granted | fastpath
------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+---------------------+---------+----------
 virtualxid |          |          |      |       | 3/929      |               |         |       |          | 3/929              |  9640 | ExclusiveLock       | t       | t
 relation   |    16390 |     2659 |      |       |            |               |         |       |          | 4/829              |  9639 | AccessShareLock     | t       | t
 relation   |    16390 |     1249 |      |       |            |               |         |       |          | 4/829              |  9639 | AccessShareLock     | t       | t
 virtualxid |          |          |      |       | 4/829      |               |         |       |          | 4/829              |  9639 | ExclusiveLock       | t       | t
 relation   |        1 |    11695 |      |       |            |               |         |       |          | 5/148              | 11949 | AccessShareLock     | t       | t
 virtualxid |          |          |      |       | 5/148      |               |         |       |          | 5/148              | 11949 | ExclusiveLock       | t       | t
 virtualxid |          |          |      |       | 1/1        |               |         |       |          | 1/0                |  7593 | ExclusiveLock       | t       | t
 object     |        0 |          |      |       |            |               |    1262 | 16390 |        0 | 4/829              |  9639 | RowExclusiveLock    | t       | f
 relation   |    16390 |     1259 |      |       |            |               |         |       |          | 4/829              |  9639 | AccessShareLock     | f       | f
 relation   |    16390 |     1259 |      |       |            |               |         |       |          | 1/0                |  7593 | AccessExclusiveLock | t       | f
 object     |        0 |          |      |       |            |               |    1262 | 16390 |        0 | 3/929              |  9640 | RowExclusiveLock    | t       | f
 relation   |    16390 |     1259 |      |       |            |               |         |       |          | 3/929              |  9640 | AccessShareLock     | f       | f
(12 rows)

Here there seems to be many threads that all wait for a lock on relation 1259, however I have no idea what the process that has the lock waits for (pid: 7593). I can't use pg_terminate_backend(7593) to free the lock, so my only option is to restart it.

I don't know if anyone can reproduce this effect?

why are the pids still holding locks even after the have been disconnected in pg_stat_activity?
I assume that one of the pids is the slave thread applying updates from the master?

This is not the exact problem Im trying to reproduce, in that example all connection end up in LOCKED state and there appear to be a deadlock when inspecting the pg_locks table that don't get discovered. I hope I can reproduce that example also at some time.

But have anyone else experienced problem with slave stopping to replicate because everything ends up being locked?

Cheers
Kim Carlsen

Browse pgsql-general by date

  From Date Subject
Next Message Igal @ Lucee.org 2017-10-20 18:22:06 Re: tgrm index for word_similarity
Previous Message Stephen Froehlich 2017-10-20 17:34:13 Issues shutting down PostgreSQL 10 cleanly