Re: Postgres 10, slave not catching up with master

From: Hellmuth Vargas <hivs77(at)gmail(dot)com>
To: boris(at)infosplet(dot)com
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Postgres 10, slave not catching up with master
Date: 2018-10-23 18:24:43
Message-ID: CAN3Qy4ptBTJtmLDE9pC2RT7Z1XJ_StSRkGDmR18DZanZsqzo5A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi

Both servers are configured with the same date, time and time configuration?

El mar., 23 de oct. de 2018 a la(s) 13:16, Hellmuth Vargas (hivs77(at)gmail(dot)com)
escribió:

> Hi
>
> which result you get from the following query:
>
> SELECT CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()
> THEN 0
> ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp())
> END AS log_delay;
>
> source:
>
> https://severalnines.com/blog/postgresql-streaming-replication-deep-dive
>
> El mar., 23 de oct. de 2018 a la(s) 11:28, Boris Sagadin (
> boris(at)infosplet(dot)com) escribió:
>
>> Nothing special, just:
>>
>> standby_mode = 'on'
>> primary_conninfo = 'host=... user=repmgr application_name=nodex'
>> recovery_target_timeline = 'latest'
>>
>>
>> Boris
>>
>> On Tue, Oct 23, 2018 at 3:10 PM, Hellmuth Vargas <hivs77(at)gmail(dot)com>
>> wrote:
>>
>>> Hi
>>>
>>> can share recovery.conf file settings??
>>>
>>> El mar., 23 de oct. de 2018 a la(s) 00:28, Boris Sagadin (
>>> boris(at)infosplet(dot)com) escribió:
>>>
>>>> Yes, turning wal_compression off improves things. Slave that was
>>>> mentioned unfortunately lagged too much before this setting was applied and
>>>> was turned off. However the remaining slave lags less now, although still
>>>> occasionally up to a few minutes. I think single threadedness of recovery
>>>> is a big slowdown for write heavy databases. Maybe an option to increase
>>>> wal_size beyond 16MB in v11 will help.
>>>>
>>>> In the meantime we'll solve this by splitting the DB to 2 or 3 clusters
>>>> or maybe trying out some sharding solution like Citus.
>>>>
>>>>
>>>> Boris
>>>>
>>>> On Sun, Oct 21, 2018 at 9:06 AM, Boris Sagadin <boris(at)infosplet(dot)com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I have a database running on i3.8xlarge (256GB RAM, 32 CPU cores, 4x
>>>>> 1.9TB NVMe drive) AWS instance with about 5TB of disk space occupied, ext4,
>>>>> Ubuntu 16.04.
>>>>>
>>>>> Multi-tenant DB with about 40000 tables, insert heavy.
>>>>>
>>>>> I started a new slave with identical HW specs, SR. DB started syncing
>>>>> from master, which took about 4 hours, then it started applying the WALs.
>>>>> However, it seems it can't catch up. Delay is still around 3 hours
>>>>> (measured with now() - pg_last_xact_replay_timestamp()), even a day later.
>>>>> It goes a few 100s up and down, but it seems to float around 3h mark.
>>>>>
>>>>> Disk IO is low at about 10%, measured with iostat, no connected
>>>>> clients, recovery process is at around 90% CPU single core usage.
>>>>>
>>>>> Tried tuning the various parameters, but with no avail. Only thing I
>>>>> found suspicious is stracing the recovery process constantly produces many
>>>>> errors such as:
>>>>>
>>>>> lseek(428, 0, SEEK_END) = 780124160
>>>>> lseek(30, 0, SEEK_END) = 212992
>>>>> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource
>>>>> temporarily unavailable)
>>>>> lseek(680, 0, SEEK_END) = 493117440
>>>>> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource
>>>>> temporarily unavailable)
>>>>> lseek(774, 0, SEEK_END) = 583368704
>>>>>
>>>>> ...[snip]...
>>>>>
>>>>> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource
>>>>> temporarily unavailable)
>>>>> lseek(774, 0, SEEK_END) = 583368704
>>>>> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource
>>>>> temporarily unavailable)
>>>>> lseek(277, 0, SEEK_END) = 502882304
>>>>> lseek(6, 516096, SEEK_SET) = 516096
>>>>> read(6,
>>>>> "\227\320\5\0\1\0\0\0\0\340\7\246\26\274\0\0\315\0\0\0\0\0\0\0}\0178\5&/\260\r"...,
>>>>> 8192) = 8192
>>>>> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource
>>>>> temporarily unavailable)
>>>>> lseek(735, 0, SEEK_END) = 272809984
>>>>> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource
>>>>> temporarily unavailable)
>>>>> lseek(277, 0, SEEK_END) = 502882304
>>>>>
>>>>> ls -l fd/9
>>>>> lr-x------ 1 postgres postgres 64 Oct 21 06:21 fd/9 -> pipe:[46358]
>>>>>
>>>>>
>>>>> Perf top on recovery produces:
>>>>>
>>>>> 27.76% postgres [.] pglz_decompress
>>>>> 9.90% [kernel] [k] entry_SYSCALL_64_after_swapgs
>>>>> 7.09% postgres [.] hash_search_with_hash_value
>>>>> 4.26% libpthread-2.23.so [.] llseek
>>>>> 3.64% libpthread-2.23.so [.] __read_nocancel
>>>>> 2.80% [kernel] [k] __fget_light
>>>>> 2.67% postgres [.] 0x000000000034d3ba
>>>>> 1.85% [kernel] [k] ext4_llseek
>>>>> 1.84% postgres [.] pg_comp_crc32c_sse42
>>>>> 1.44% postgres [.] hash_any
>>>>> 1.35% postgres [.] 0x000000000036afad
>>>>> 1.29% postgres [.] MarkBufferDirty
>>>>> 1.21% postgres [.] XLogReadRecord
>>>>> [...]
>>>>>
>>>>> Tried changing the process limits with prlimit to unlimited, but no
>>>>> change.
>>>>>
>>>>> I can turn off the WAL compression but I doubt this is the main
>>>>> culprit. Any ideas appreciated.
>>>>>
>>>>> Regards,
>>>>> Boris
>>>>>
>>>>>
>>>>
>>>
>>> --
>>> Cordialmente,
>>>
>>> Ing. Hellmuth I. Vargas S.
>>> Esp. Telemática y Negocios por Internet
>>> Oracle Database 10g Administrator Certified Associate
>>> EnterpriseDB Certified PostgreSQL 9.3 Associate
>>>
>>>
>>
>
> --
> Cordialmente,
>
> Ing. Hellmuth I. Vargas S.
>
>
>

--
Cordialmente,

Ing. Hellmuth I. Vargas S.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ravi Krishna 2018-10-23 19:58:20 Oracle vs PG
Previous Message Hellmuth Vargas 2018-10-23 18:16:50 Re: Postgres 10, slave not catching up with master