From: | Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com> |
---|---|
To: | pgsql-general(at)lists(dot)postgresql(dot)org |
Subject: | Re: Logical replication hangs up. |
Date: | 2018-11-05 09:23:23 |
Message-ID: | 02fc0ca8-3b46-3c5d-9cf6-46d75fed7ec5@matrix.gatewaynet.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On 3/11/18 6:02 μ.μ., Jerry Sievers wrote:
> Aleš Zelený <zeleny(dot)ales(at)gmail(dot)com> writes:
>
>> Hello,
>>
>> we are suing logical replication on 10.4 and it now hangs. After
>> some timeout it is retarted again, replaying 18GB of data and then
>> hangs (while 7GB of wals remains to be proceeded).
> Timeout...
>
> Have a look at the 2 setting wal sender/receiver timeout and you
> probably need to raise the sender timeout value.
That's been the case for me as well, setting both values to '5 mins' solved my issues.
>
> HTH
>
>> The backlog of 18GB comes from a failed migration adding new table to
>> replication while replication user was not granted to select the
>> table. This was clear from log files and once resolved by adding
>> select privilege, I thought that all will work as usual (same
>> happened in test env. many times and adding missing grant for select
>> was sufficient to get it working... these were issues on tests).
>>
>> RDBMS Version:
>> PostgreSQL 10.4 (Ubuntu 10.4-2.pgdg16.04+1) on x86_64-pc-linux-gnu,
>> compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609,
>> 64-bit
>>
>> Publication (master) instance error message from log:
>>
>> 2018-11-02 18:34:14 UTC 7974 5bdc8d27.1f26 7 192.168.23.11(58884)
>> master_prod repusr 0 5/0 sub_eur idle [00000]:LOG: terminating
>> walsender process due to replication timeout
>>
>> Subscription instance log:
>>
>> 2018-11-02 18:34:14 UTC 8657 5bdc8d26.21d1 2 0 3/0
>> [XX000]:ERROR: could not receive data from WAL stream: SSL
>> connection has been closed unexpectedly
>> 2018-11-02 18:34:14 UTC 24699 5b923b1c.607b 1209397 0
>> [00000]:LOG: worker process: logical replication worker for
>> subscription 37932 (PID 8657) exited with exit code 1
>>
>> While it seems to be a network issue, it is may be not - we have
>> checked the network and even monitoring, all the time some packets
>> were exchanged.
>>
>> We do have 3 subscriptions(thus repl. slots) for one publication, one
>> subscriber instance is within same datacenter as master, remainig
>> subscribers are remote.
>>
>>
>>>> select * from pg_replication_slots
>> +------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+
>> | slot_name | plugin | slot_type | datoid | database |
>> temporary | active | active_pid | xmin | catalog_xmin | restart_lsn
>> | confirmed_flush_lsn |
>> +------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+
>> | sub_usd | pgoutput | logical | 16421 | master_prod |
>> False | True | 16604 | | 5536488 | 426/AAE55A68
>> | 426/AAE55A68 |
>> | sub_cad | pgoutput | logical | 16421 | master_prod |
>> False | True | 22875 | | 5536488 | 426/AAE55A68
>> | 426/AAE55A68 |
>> | sub_eur | pgoutput | logical | 16421 | master_prod |
>> False | True | 16605 | | 5536488 | 426/AAE55A68
>> | 426/AAE55A68 |
>> +------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+
>>
>>
>> Once after the error occurs, wal senders are re/started and they
>> reads wal files until they reach restart_lsn wal file:
>>
>>>> select pg_walfile_name('426/AAE55A68')
>> +--------------------------+
>> | pg_walfile_name |
>> +--------------------------+
>> | 0000000100000426000000AA |
>> +--------------------------+
>>
>> # changing file names until they reach this one:
>> root(at)master-db:/pgsql/pgcluster/10/master_prod# lsof -p 1560 -p 5758
>> -p 5790| grep pg_wal
>> postgres 1560 postgres 10r REG 259,3 16777216 115766007 /
>> pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069
>> postgres 5758 postgres 10r REG 259,3 16777216 115766007 /
>> pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069
>> postgres 5790 postgres 10r REG 259,3 16777216 115766007 /
>> pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069
>>
>>
>> At this moment sent_lsn stops growing and nothing is happening for a
>> while.
>>
>> select * from pg_stat_replication;
>>
>> pid | usesysid | usename | application_name | client_addr |
>> client_hostname | client_port | backend_start |
>> backend_xmin | state | sent_lsn | write_lsn | flush_lsn |
>> replay_lsn | write_lag | flush_lag | replay_lag | sync_priority |
>> sync_state
>> -------+----------+---------+--------------------+---------------+-----------------+-------------+-------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------+-----------+------------+---------------+------------
>> 16604 | 37868 | repusr | sub_usd | 192.168.21.11 |
>> | 35010 | 2018-11-02 23:52:22.059157+00 |
>> | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68
>> | 426/AAE55A68 | | | | 0 |
>> async
>> 16605 | 37868 | repusr | sub_eur | 192.168.23.11 |
>> | 36388 | 2018-11-02 23:52:24.75038+00 |
>> | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68
>> | 426/AAE55A68 | | | | 0 |
>> async
>> 12387 | 37868 | repusr | sub_cad | 192.168.22.11 |
>> | 51982 | 2018-11-02 23:30:31.372404+00 |
>> | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68
>> | 426/AAE55A68 | | | | 0 |
>> async
>> (3 rows)
>>
>> What I found being done behind the scenes while from
>> pg_stat_replication "nothing is happening":
>>
>> On Master wal snaders strace look like:
>>
>> open("pg_replslot/sub_usd/xid-6040508-lsn-429-57000000.snap",
>> O_WRONLY|O_CREAT|O_APPEND, 0600) = 15
>> write(15, "\334\0\0\0\0\0\0\0H\3\234W)\4\0\0\0\0\0\0\0\0\0\0\177\6\0\
>> 0%(at)\0\0"..., 220) = 220
>> close(15) = 0
>>
>> On subscription instance, strice for wal reciever looks like this:
>> recvfrom(7, 0x55f56a3783c3, 5, 0, NULL, NULL) = -1 EAGAIN (Resource
>> temporarily unavailable)
>> epoll_create1(EPOLL_CLOEXEC) = 10
>> epoll_ctl(10, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=
>> 1781282352, u64=94512536630832}}) = 0
>> epoll_ctl(10, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=
>> 1781282376, u64=94512536630856}}) = 0
>> epoll_ctl(10, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=
>> 1781282400, u64=94512536630880}}) = 0
>> epoll_wait(10, [], 1, 1000) = 0
>>
>> tcpdump:
>> 21:33:16.811590 IP (tos 0x0, ttl 64, id 24478, offset 0, flags [DF],
>> proto TCP (6), length 52)
>> 192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xe2ab
>> (correct), ack 1360, win 235, options [nop,nop,TS val 3699603 ecr
>> 2060668460], length 0
>> 21:33:26.821860 IP (tos 0x0, ttl 64, id 64522, offset 0, flags [DF],
>> proto TCP (6), length 120)
>> 192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum
>> 0x3ecf (incorrect -> 0xbaf0), seq 1360:1428, ack 1, win 265, options
>> [nop,nop,TS val 2060670963 ecr 3699603], length 68
>> 21:33:26.822156 IP (tos 0x0, ttl 64, id 24479, offset 0, flags [DF],
>> proto TCP (6), length 52)
>> 192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xceda
>> (correct), ack 1428, win 235, options [nop,nop,TS val 3702105 ecr
>> 2060670963], length 0
>> 21:33:36.832445 IP (tos 0x0, ttl 64, id 64523, offset 0, flags [DF],
>> proto TCP (6), length 120)
>> 192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum
>> 0x3ecf (incorrect -> 0x3881), seq 1428:1496, ack 1, win 265, options
>> [nop,nop,TS val 2060673465 ecr 3702105], length 68
>> 21:33:36.832752 IP (tos 0x0, ttl 64, id 24480, offset 0, flags [DF],
>> proto TCP (6), length 52)
>> 192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xbb09
>> (correct), ack 1496, win 235, options [nop,nop,TS val 3704608 ecr
>> 2060673465], length 0
>> 21:33:46.843063 IP (tos 0x0, ttl 64, id 64524, offset 0, flags [DF],
>> proto TCP (6), length 120)
>> 192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum
>> 0x3ecf (incorrect -> 0xcf34), seq 1496:1564, ack 1, win 265, options
>> [nop,nop,TS val 2060675968 ecr 3704608], length 68
>> 21:33:46.843362 IP (tos 0x0, ttl 64, id 24481, offset 0, flags [DF],
>> proto TCP (6), length 52)
>> 192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xa737
>> (correct), ack 1564, win 235, options [nop,nop,TS val 3707111 ecr
>> 2060675968], length 0
>>
>> Some checksum issues, but finally ack-ed. Tried to open psql session
>> from host running subscription instance to master instance and it
>> worked like a charm.
>>
>>
>> I've tried pg_waldump, but nothing what I'll be able to consider as
>> error...
>>>> select pg_walfile_name('426/AAE55A68')
>> +--------------------------+
>> | pg_walfile_name |
>> +--------------------------+
>> | 0000000100000426000000AA |
>> +--------------------------+
>> /usr/lib/postgresql/10/bin/pg_waldump -s 426/AAE55A68
>> 0000000100000426000000AA | less
>>
>> rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn:
>> 426/AAE55A68, prev 426/AAE559C0, desc: RUNNING_XACTS nextXid 5536488
>> latestCompletedXid 5536487 oldestRunningXid 5536488
>> rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn:
>> 426/AAE55AA0, prev 426/AAE55A68, desc: RUNNING_XACTS nextXid 5536488
>> latestCompletedXid 5536487 oldestRunningXid 5536488
>> rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn:
>> 426/AAE55AD8, prev 426/AAE55AA0, desc: RUNNING_XACTS nextXid 5536488
>> latestCompletedXid 5536487 oldestRunningXid 5536488
>> rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn:
>> 426/AAE55B10, prev 426/AAE55AD8, desc: RUNNING_XACTS nextXid 5536488
>> latestCompletedXid 5536487 oldestRunningXid 5536488
>> rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn:
>> 426/AAE55B48, prev 426/AAE55B10, desc: RUNNING_XACTS nextXid 5536488
>> latestCompletedXid 5536487 oldestRunningXid 5536488
>> rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn:
>> 426/AAE55B80, prev 426/AAE55B48, desc: RUNNING_XACTS nextXid 5536488
>> latestCompletedXid 5536487 oldestRunningXid 5536488
>> rmgr: XLOG len (rec/tot): 24/ 24, tx: 0, lsn:
>> 426/AAE55BB8, prev 426/AAE55B80, desc: SWITCH
>>
>>
>> I've also did the same for sent_lsn:
>>>> select pg_walfile_name('429/69E9CC60');
>> +--------------------------+
>> | pg_walfile_name |
>> +--------------------------+
>> | 000000010000042900000069 |
>> +--------------------------+
>>
>> /usr/lib/postgresql/10/bin/pg_waldump -s 429/69E9CC60
>> 000000010000042900000069 | less
>> rmgr: Heap len (rec/tot): 76/ 76, tx: 5536495, lsn:
>> 429/69E9CC60, prev 429/69E9CC10, desc: DELETE off 81 KEYS_UPDATED ,
>> blkref #0: rel 1663/16421/38572 blk 40
>> rmgr: Heap len (rec/tot): 76/ 76, tx: 5536495, lsn:
>> 429/69E9CCB0, prev 429/69E9CC60, desc: DELETE off 82 KEYS_UPDATED ,
>> blkref #0: rel 1663/16421/38572 blk 40
>> rmgr: Heap len (rec/tot): 76/ 76, tx: 5536495, lsn:
>> 429/69E9CD00, prev 429/69E9CCB0, desc: DELETE off 83 KEYS_UPDATED ,
>> blkref #0: rel 1663/16421/38572 blk 40
>>
>> The relation 38572 from list above is an ordinary non replicated
>> table.
>>
>> Any help or advice how to diagnose/get it working is highly
>> appreciated.
>>
>> Kind regards Ales Zeleny
>>
>>
--
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt
From | Date | Subject | |
---|---|---|---|
Next Message | pavan95 | 2018-11-05 10:09:06 | Re: Password management in PostgreSQL |
Previous Message | Condor | 2018-11-05 08:56:57 | Question about index on different tablespace and rebuild it |