Quick Links

Re: Postgres 10, slave not catching up with master

From:	Hellmuth Vargas <hivs77(at)gmail(dot)com>
To:	boris(at)infosplet(dot)com
Cc:	pgsql-general(at)lists(dot)postgresql(dot)org
Subject:	Re: Postgres 10, slave not catching up with master
Date:	2018-10-23 13:10:10
Message-ID:	CAN3Qy4oWBijSyKyfaOpju=prGnCnsHNr5ddMCnMX4Kh4CHU5nA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

can share recovery.conf file settings??

El mar., 23 de oct. de 2018 a la(s) 00:28, Boris Sagadin (
boris(at)infosplet(dot)com) escribió:

> Yes, turning wal_compression off improves things. Slave that was mentioned
> unfortunately lagged too much before this setting was applied and was
> turned off. However the remaining slave lags less now, although still
> occasionally up to a few minutes. I think single threadedness of recovery
> is a big slowdown for write heavy databases. Maybe an option to increase
> wal_size beyond 16MB in v11 will help.
>
> In the meantime we'll solve this by splitting the DB to 2 or 3 clusters or
> maybe trying out some sharding solution like Citus.
>
>
> Boris
>
> On Sun, Oct 21, 2018 at 9:06 AM, Boris Sagadin <boris(at)infosplet(dot)com>
> wrote:
>
>> Hello,
>>
>> I have a database running on i3.8xlarge (256GB RAM, 32 CPU cores, 4x
>> 1.9TB NVMe drive) AWS instance with about 5TB of disk space occupied, ext4,
>> Ubuntu 16.04.
>>
>> Multi-tenant DB with about 40000 tables, insert heavy.
>>
>> I started a new slave with identical HW specs, SR. DB started syncing
>> from master, which took about 4 hours, then it started applying the WALs.
>> However, it seems it can't catch up. Delay is still around 3 hours
>> (measured with now() - pg_last_xact_replay_timestamp()), even a day later.
>> It goes a few 100s up and down, but it seems to float around 3h mark.
>>
>> Disk IO is low at about 10%, measured with iostat, no connected clients,
>> recovery process is at around 90% CPU single core usage.
>>
>> Tried tuning the various parameters, but with no avail. Only thing I
>> found suspicious is stracing the recovery process constantly produces many
>> errors such as:
>>
>> lseek(428, 0, SEEK_END) = 780124160
>> lseek(30, 0, SEEK_END) = 212992
>> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily
>> unavailable)
>> lseek(680, 0, SEEK_END) = 493117440
>> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily
>> unavailable)
>> lseek(774, 0, SEEK_END) = 583368704
>>
>> ...[snip]...
>>
>> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily
>> unavailable)
>> lseek(774, 0, SEEK_END) = 583368704
>> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily
>> unavailable)
>> lseek(277, 0, SEEK_END) = 502882304
>> lseek(6, 516096, SEEK_SET) = 516096
>> read(6,
>> "\227\320\5\0\1\0\0\0\0\340\7\246\26\274\0\0\315\0\0\0\0\0\0\0}\0178\5&/\260\r"...,
>> 8192) = 8192
>> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily
>> unavailable)
>> lseek(735, 0, SEEK_END) = 272809984
>> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily
>> unavailable)
>> lseek(277, 0, SEEK_END) = 502882304
>>
>> ls -l fd/9
>> lr-x------ 1 postgres postgres 64 Oct 21 06:21 fd/9 -> pipe:[46358]
>>
>>
>> Perf top on recovery produces:
>>
>> 27.76% postgres [.] pglz_decompress
>> 9.90% [kernel] [k] entry_SYSCALL_64_after_swapgs
>> 7.09% postgres [.] hash_search_with_hash_value
>> 4.26% libpthread-2.23.so [.] llseek
>> 3.64% libpthread-2.23.so [.] __read_nocancel
>> 2.80% [kernel] [k] __fget_light
>> 2.67% postgres [.] 0x000000000034d3ba
>> 1.85% [kernel] [k] ext4_llseek
>> 1.84% postgres [.] pg_comp_crc32c_sse42
>> 1.44% postgres [.] hash_any
>> 1.35% postgres [.] 0x000000000036afad
>> 1.29% postgres [.] MarkBufferDirty
>> 1.21% postgres [.] XLogReadRecord
>> [...]
>>
>> Tried changing the process limits with prlimit to unlimited, but no
>> change.
>>
>> I can turn off the WAL compression but I doubt this is the main culprit.
>> Any ideas appreciated.
>>
>> Regards,
>> Boris
>>
>>
>

--
Cordialmente,

Ing. Hellmuth I. Vargas S.
Esp. Telemática y Negocios por Internet
Oracle Database 10g Administrator Certified Associate
EnterpriseDB Certified PostgreSQL 9.3 Associate

In response to

Re: Postgres 10, slave not catching up with master at 2018-10-23 05:28:27 from Boris Sagadin

Responses

Re: Postgres 10, slave not catching up with master at 2018-10-23 16:28:54 from Boris Sagadin

Browse pgsql-general by date

	From	Date	Subject
Next Message	Daniel Fink (PDF)	2018-10-23 13:24:28	Enabling bdr in multiple databases on the same postgresql instance/cluster
Previous Message	Jeff Janes	2018-10-23 13:06:06	Re: Replication question