Quick Links

Re: Postgres 10, slave not catching up with master

From:	Hellmuth Vargas <hivs77(at)gmail(dot)com>
To:	boris(at)infosplet(dot)com
Cc:	pgsql-general(at)lists(dot)postgresql(dot)org
Subject:	Re: Postgres 10, slave not catching up with master
Date:	2018-10-24 13:34:05
Message-ID:	CAN3Qy4rE0vNo5Xudyt6Hg3mocYrxe85Zgi+WL2z69RdYSuvYBQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Hi
El mié., 24 de oct. de 2018 a la(s) 00:39, Boris Sagadin (
boris(at)infosplet(dot)com) escribió:

> Yes, times are all identical, set to UTC, ntpd is used.
>
> log_delay
> -----------
> 15.788175
>
> This is delay at this moment, but we graph replication delay and it's
> fluctuating between 0 and 30s.
>

But the fluctuation is between 0 and 30s!1, are not 4 hours fortunately.
Apart from the theme wal compression I think you should check networks

> Before I turned off wal compression, lag was much bigger (0 to up to 8
> minutes). We have lots of tables (40k) and many upserts.
>
>
> Boris
>
> On Tue, Oct 23, 2018 at 8:24 PM, Hellmuth Vargas <hivs77(at)gmail(dot)com> wrote:
>
>> Hi
>>
>> Both servers are configured with the same date, time and time
>> configuration?
>>
>> El mar., 23 de oct. de 2018 a la(s) 13:16, Hellmuth Vargas (
>> hivs77(at)gmail(dot)com) escribió:
>>
>>> Hi
>>>
>>> which result you get from the following query:
>>>
>>> SELECT CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()
>>> THEN 0
>>> ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp())
>>> END AS log_delay;
>>>
>>> source:
>>>
>>> https://severalnines.com/blog/postgresql-streaming-replication-deep-dive
>>>
>>> El mar., 23 de oct. de 2018 a la(s) 11:28, Boris Sagadin (
>>> boris(at)infosplet(dot)com) escribió:
>>>
>>>> Nothing special, just:
>>>>
>>>> standby_mode = 'on'
>>>> primary_conninfo = 'host=... user=repmgr application_name=nodex'
>>>> recovery_target_timeline = 'latest'
>>>>
>>>>
>>>> Boris
>>>>
>>>> On Tue, Oct 23, 2018 at 3:10 PM, Hellmuth Vargas <hivs77(at)gmail(dot)com>
>>>> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> can share recovery.conf file settings??
>>>>>
>>>>> El mar., 23 de oct. de 2018 a la(s) 00:28, Boris Sagadin (
>>>>> boris(at)infosplet(dot)com) escribió:
>>>>>
>>>>>> Yes, turning wal_compression off improves things. Slave that was
>>>>>> mentioned unfortunately lagged too much before this setting was applied and
>>>>>> was turned off. However the remaining slave lags less now, although still
>>>>>> occasionally up to a few minutes. I think single threadedness of recovery
>>>>>> is a big slowdown for write heavy databases. Maybe an option to increase
>>>>>> wal_size beyond 16MB in v11 will help.
>>>>>>
>>>>>> In the meantime we'll solve this by splitting the DB to 2 or 3
>>>>>> clusters or maybe trying out some sharding solution like Citus.
>>>>>>
>>>>>>
>>>>>> Boris
>>>>>>
>>>>>> On Sun, Oct 21, 2018 at 9:06 AM, Boris Sagadin <boris(at)infosplet(dot)com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I have a database running on i3.8xlarge (256GB RAM, 32 CPU cores, 4x
>>>>>>> 1.9TB NVMe drive) AWS instance with about 5TB of disk space occupied, ext4,
>>>>>>> Ubuntu 16.04.
>>>>>>>
>>>>>>> Multi-tenant DB with about 40000 tables, insert heavy.
>>>>>>>
>>>>>>> I started a new slave with identical HW specs, SR. DB started
>>>>>>> syncing from master, which took about 4 hours, then it started applying the
>>>>>>> WALs. However, it seems it can't catch up. Delay is still around 3 hours
>>>>>>> (measured with now() - pg_last_xact_replay_timestamp()), even a day later.
>>>>>>> It goes a few 100s up and down, but it seems to float around 3h mark.
>>>>>>>
>>>>>>> Disk IO is low at about 10%, measured with iostat, no connected
>>>>>>> clients, recovery process is at around 90% CPU single core usage.
>>>>>>>
>>>>>>> Tried tuning the various parameters, but with no avail. Only thing I
>>>>>>> found suspicious is stracing the recovery process constantly produces many
>>>>>>> errors such as:
>>>>>>>
>>>>>>> lseek(428, 0, SEEK_END) = 780124160
>>>>>>> lseek(30, 0, SEEK_END) = 212992
>>>>>>> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource
>>>>>>> temporarily unavailable)
>>>>>>> lseek(680, 0, SEEK_END) = 493117440
>>>>>>> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource
>>>>>>> temporarily unavailable)
>>>>>>> lseek(774, 0, SEEK_END) = 583368704
>>>>>>>
>>>>>>> ...[snip]...
>>>>>>>
>>>>>>> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource
>>>>>>> temporarily unavailable)
>>>>>>> lseek(774, 0, SEEK_END) = 583368704
>>>>>>> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource
>>>>>>> temporarily unavailable)
>>>>>>> lseek(277, 0, SEEK_END) = 502882304
>>>>>>> lseek(6, 516096, SEEK_SET) = 516096
>>>>>>> read(6,
>>>>>>> "\227\320\5\0\1\0\0\0\0\340\7\246\26\274\0\0\315\0\0\0\0\0\0\0}\0178\5&/\260\r"...,
>>>>>>> 8192) = 8192
>>>>>>> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource
>>>>>>> temporarily unavailable)
>>>>>>> lseek(735, 0, SEEK_END) = 272809984
>>>>>>> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource
>>>>>>> temporarily unavailable)
>>>>>>> lseek(277, 0, SEEK_END) = 502882304
>>>>>>>
>>>>>>> ls -l fd/9
>>>>>>> lr-x------ 1 postgres postgres 64 Oct 21 06:21 fd/9 -> pipe:[46358]
>>>>>>>
>>>>>>>
>>>>>>> Perf top on recovery produces:
>>>>>>>
>>>>>>> 27.76% postgres [.] pglz_decompress
>>>>>>> 9.90% [kernel] [k] entry_SYSCALL_64_after_swapgs
>>>>>>> 7.09% postgres [.] hash_search_with_hash_value
>>>>>>> 4.26% libpthread-2.23.so [.] llseek
>>>>>>> 3.64% libpthread-2.23.so [.] __read_nocancel
>>>>>>> 2.80% [kernel] [k] __fget_light
>>>>>>> 2.67% postgres [.] 0x000000000034d3ba
>>>>>>> 1.85% [kernel] [k] ext4_llseek
>>>>>>> 1.84% postgres [.] pg_comp_crc32c_sse42
>>>>>>> 1.44% postgres [.] hash_any
>>>>>>> 1.35% postgres [.] 0x000000000036afad
>>>>>>> 1.29% postgres [.] MarkBufferDirty
>>>>>>> 1.21% postgres [.] XLogReadRecord
>>>>>>> [...]
>>>>>>>
>>>>>>> Tried changing the process limits with prlimit to unlimited, but no
>>>>>>> change.
>>>>>>>
>>>>>>> I can turn off the WAL compression but I doubt this is the main
>>>>>>> culprit. Any ideas appreciated.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Boris
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Cordialmente,
>>>>>
>>>>> Ing. Hellmuth I. Vargas S.
>>>>> Esp. Telemática y Negocios por Internet
>>>>> Oracle Database 10g Administrator Certified Associate
>>>>> EnterpriseDB Certified PostgreSQL 9.3 Associate
>>>>>
>>>>>
>>>>
>>>
>>> --
>>> Cordialmente,
>>>
>>> Ing. Hellmuth I. Vargas S.
>>>
>>>
>>>
>>
>> --
>> Cordialmente,
>>
>> Ing. Hellmuth I. Vargas S.
>>
>>
>>
>

--
Cordialmente,

Ing. Hellmuth I. Vargas S.
Esp. Telemática y Negocios por Internet
Oracle Database 10g Administrator Certified Associate
EnterpriseDB Certified PostgreSQL 9.3 Associate

In response to

Re: Postgres 10, slave not catching up with master at 2018-10-24 05:39:44 from Boris Sagadin

Responses

Re: Postgres 10, slave not catching up with master at 2018-10-24 17:11:59 from Boris Sagadin

Browse pgsql-general by date

	From	Date	Subject
Next Message	Adrian Klaver	2018-10-24 13:38:38	Re: no libpq.pc for Centos 6 or Centos 7
Previous Message	Steve Clark	2018-10-24 13:28:34	no libpq.pc for Centos 6 or Centos 7