Re: Replication Lag Causes

From: Felipe Santos <felipepts(at)gmail(dot)com>
To: Mike Wilson <mfwilson(at)gmail(dot)com>
Cc: Greg Spiegelberg <gspiegelberg(at)gmail(dot)com>, pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Replication Lag Causes
Date: 2014-11-03 14:34:44
Message-ID: CAPYcRiXQhifiZwL+jY8+_QzkQY69QbNZcpW4PYe4X+GCxAQ4eg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

2014-11-02 19:16 GMT-02:00 Mike Wilson <mfwilson(at)gmail(dot)com>:

> Thanks for the information Greg.
>
> Unfortunately modifying the application stack this close to the holiday
> season won’t be an option so I’m left with:
> 1) Trying to optimize the settings I have for the query mix I have.
> 2) Optimize any long running DML queries (if any) to prevent lag due to
> locks.
> 3) Getting a better understanding of “what” causes lag.
>
> #3 will probably be central to at least minimizing lag during heavy DML
> load. If anyone has a good resource to describe when a slave would start
> to lag potentially that would help me hunt for the cause. I know long
> running DML on the master may cause lag but I’m uncertain as to the
> specifics of why. During periods of lag we do have more DML than usual
> running against the master but the queries themselves are very quick
> although there might be 20-30 DML operations per second against some of our
> central tables that store user account information. Even under heavy DML
> the queries still return in under a second. Possibly a large volume of of
> short running DML cause replication lag issues for large tables (~20M)?
>
> Thanks again for your help. BDR looks interesting but probably too
> cutting edge for my client.
>
> Mike Wilson
>
>
>
>
> On Nov 2, 2014, at 12:33 PM, Greg Spiegelberg <gspiegelberg(at)gmail(dot)com>
> wrote:
>
> Hi Mike,
>
> Sounds very familiar. Our master fans out to 16 slaves (cascading) and we
> had great success with segregating database queries to different slaves and
> some based on network latency. I'd suggest, if possible, alter the
> application to use the slave for simple SELECT's and FUNCTION's performing
> SELECT-like only work while limiting those applications and queries that
> perform DML to the master (obviously). If the load on the slave increases
> too much, spin up another slave. I'd mention from experience that it could
> be the load on the slave that is giving the appearance of replication lag.
> This is what led us to having (1) slave per application.
>
> There is also the BDR multi-master available in 9.4beta if you're wanting
> to live on the edge.
>
> -Greg
>
> On Sat, Nov 1, 2014 at 4:33 PM, Mike Wilson <mfwilson(at)gmail(dot)com> wrote:
>
>> I have two 9.3.4 PG instances that back a large internet website that has
>> very seasonal traffic and can generate large query loads. My instances are
>> in a master-slave streaming replication setup and are stable and in
>> general perform very well. The only issues we have with the boxes is that
>> when the master is busy the slave may start to lag excessively. I can give
>> specifics as to what heavily loaded means and additionally the
>> postgresql.conf for both boxes but my basic questions are:
>> * What causes streaming replication lag to increase?
>> * What parameters can be tuned to reduce streaming replication lag?
>> * Can a loaded slave affect lag adversely?
>> * Can increasing max_wal_senders help reduce lag?
>>
>> The reason I ask this is that as mentioned above the servers are stable
>> and are real troopers in general as they back a very popular web site that
>> puts the master under heavy seasonal load at times. At those times though
>> we see an almost exponential growth in streaming replication lag compared
>> to load on the master.
>>
>> For example, the master is a very beefy Solaris:
>> * 4 Recent Intel Zeons (16 physical cores)
>> * 256 GB of ECC RAM
>> * 12 TB of ZFS (spindle and SSD internal storage)
>> * DB on disk size is 2TB
>> * ZFS ARC cache of roughly 250G.
>> * ZFS ARC2/ZIL configured for SSD’s (this is awesome by the way)
>>
>> Basic PG Config:
>> shared_buffers = 2GB
>> work_mem = 128MB
>> max_connections = 1700 (supports roughly 100 web servers)
>> wal_keep_segments = 256 (roughly enough for 24 hours of operation
>> under heavy load)
>> wal_sender_timeout = 60s
>> replication_timeout=(not set)
>> wal_receiver_status_interval=10s
>> max_wal_senders=6
>> * wal archiving is off
>> * 98% of the queries on the master complete in under 500ms.
>> * No hung or very long running queries in general.
>>
>> The master on a normal day maintains a load of about 0.5, during which
>> replication lag to the slave is in hundreds milliseconds. When the
>> production db server is heavily hit though the load may go as high as 4 on
>> the master and the streaming replication lag may increase to more than 2
>> hours relatively quickly. Load on the slave is generally below 1 even when
>> the master is heavily loaded. The traffic to the master is primarily read
>> with about 10% DML (new users, purchase records, etc). DML statements
>> increase proportionally when under load though. The master and slave are
>> connected via dedicated 10G fiber link and even under heavy load the
>> utilization of the link is nowhere near close to saturation. BTW, the
>> slave does run some reported related queries throughout the day that might
>> take up to a minute to complete.
>>
>> I have the task of figuring out why this otherwise healthy DB starts to
>> lag so badly under load and if there is anything that we could do about
>> it. I’ve been wondering particularly if we should up the max_wal_senders
>> but from the docs it is unclear if that would help. In my testing with
>> pg_bench on our dev boxes which were the previous production hardware for
>> these servers I have determined that it doesn’t take much DML load on the
>> master to get the slave to start lagging severely. I was wondering if this
>> was expected and/or some design consideration? Possibly streaming
>> replication isn’t meant to be used for heavily hit databases and maintain
>> small lag times? I would like to believe that the fault is something we
>> have done though and that there is some parameter we could tune to reduce
>> this lag.
>>
>> Any recommendations would be very helpful.
>>
>> Mike Wilson
>> Predicate Logic Consulting
>>
>>
>>
>>
>
Hi Mike,

"Basically, the amount of DML is what really appears to cause the
replication to lag."

You said that you are using streaming replication, and streaming means
reading the logfiles and applying changes on the slave. Since DML is also
logged on the logfiles, maybe you are locking yourself by trying to read
and write to the same file(s).

You can try changing your replication type from streaming to Archive-WAL,
that way you will send archived WALs from the master to the slave, while
letting the "hot" WAL available for your DML operations only.

I've changed my replication type in the past and had no significant
increase in lag.

Cheers

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Andrew Dunstan 2014-11-03 20:00:14 assignment vs SELECT INTO
Previous Message Artūras Lapinskas 2014-11-03 01:40:22 Index order ignored after `is null` in query