Quick Links

Re: pg_basebackup blocking all queries with horrible performance

From:	Lonni J Friedman <netllama(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Jerry Sievers <gsievers19(at)comcast(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>, pgsql-admin(at)postgresql(dot)org
Subject:	Re: pg_basebackup blocking all queries with horrible performance
Date:	2012-06-12 18:37:42
Message-ID:	CAP=oouGu=Pcdk3s4ceVZkdcpTQdt3LAiGo1ukkdYBibbc1+iWQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-admin pgsql-hackers

On Tue, Jun 12, 2012 at 10:49 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Tue, Jun 12, 2012 at 2:37 AM, Lonni J Friedman <netllama(at)gmail(dot)com> wrote:
>> On Fri, Jun 8, 2012 at 7:29 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>> On Sat, Jun 9, 2012 at 4:30 AM, Lonni J Friedman <netllama(at)gmail(dot)com> wrote:
>>>> On Thu, Jun 7, 2012 at 11:04 PM, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au> wrote:
>>>>> On 06/08/2012 09:01 AM, Lonni J Friedman wrote:
>>>>>>
>>>>>> On Thu, Jun 7, 2012 at 5:07 PM, Jerry Sievers<gsievers19(at)comcast(dot)net>
>>>>>> wrote:
>>>>>>>
>>>>>>> You might try stopping pg_basebackup in place with SIGSTOP and check
>>>>>>>
>>>>>>> if problem goes away. SIGCONT and you should start having
>>>>>>> sluggishness again.
>>>>>>>
>>>>>>> If verified, then any sort of throttling mechanism should work.
>>>>>>
>>>>>>
>>>>>> I'm certain that the problem is triggered only when pg_basebackup is
>>>>>> running. Its very predictable, and goes away as soon as pg_basebackup
>>>>>> finishes running. What do you mean by a throttling mechanism?
>>>>>
>>>>>
>>>>> Sure, it only happens when pg_basebackup is running. But if you *pause*
>>>>> pg_basebackup, so it's still running but not currently doing work, does the
>>>>> problem go away? Does it come back when you unpause pg_basebackup? That's
>>>>> what Jerry was telling you to try.
>>>>>
>>>>> If the problem goes away when you pause pg_basebackup and comes back when
>>>>> you unpause it, it's probably a system load problem.
>>>>>
>>>>> If it doesn't go away, it's more likely to be a locking issue or something
>>>>> _other_ than simple load.
>>>>>
>>>>> SIGSTOP ("kill -STOP") pauses a process, and SIGCONT ("kill -CONT") resumes
>>>>> it, so on Linux you can use these to try and find out. When you SIGSTOP
>>>>> pg_basebackup then the postgres backend associated with it should block
>>>>> shortly afterwards as its buffers fill up and it can't send more data, so
>>>>> the load should come off the server.
>>>>>
>>>>> A "throttling mechanism" refers to anything that limits the rate or speed of
>>>>> a thing. In this case, what you want to do if your problem is system
>>>>> overload is to limit the speed at which pg_basebackup does its work so other
>>>>> things can still get work done. In other words you want to throttle it.
>>>>> Typical throttling mechanisms include the "ionice" and "renice" commands to
>>>>> change I/O and CPU priority, respectively.
>>>>>
>>>>> Note that you may need to change the priority of the *backend* that
>>>>> pg_basebackup is using, not necessarily the pg_basebackup command its self.
>>>>> I haven't done enough with Pg's replication to know how that works, so
>>>>> someone else will have to fill that bit in.
>>>>
>>>> Thanks for your reply. I've confirmed that issuing a SIGSTOP does
>>>> eliminate the thrashing, and issuing a SIGCONT resumes the thrash.
>>>>
>>>> I've looked at iostat output both before & during pg_basebackup runs,
>>>> and I'm not seeing any indication that the problem is due to disk IO
>>>> bottlenecks. The numbers don't vary very much at all between the good
>>>> & bad times. This is typical when pg_basebackup is running:
>>>> ########
>>>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util
>>>> md0
>>>> 0.00 0.00 67.76 68.62 4.42 1.46
>>>> 88.34 0.00 0.00 0.00 0.00 0.00 0.00
>>>> ########
>>>>
>>>> and this is when the system is ok:
>>>> ########
>>>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util
>>>> md0
>>>> 0.00 0.00 68.04 68.56 4.44 1.46
>>>> 88.39 0.00 0.00 0.00 0.00 0.00 0.00
>>>> ########
>>>>
>>>>
>>>> I looked at vmstat output, but nothing is jumping out at me as being
>>>> dramatically different when pg_basebackup is running. swap in and
>>>> swap out are zero 100% of the time for the good & bad perf cases. I
>>>> can post example output if someone is interested, or if there's
>>>> something specific that I should be looking at as a potential problem,
>>>> let me know.
>>>
>>> Did you set synchronous_standby_names to '*'? If so, the problem you
>>> encountered can happen.
>>>
>>> When synchronous_standby_names is '*', you cannot control which
>>> standbys take a role of synchronous standby. The standby which you
>>> expect to run as asynchronous one might be synchronous one. So
>>> my guess is that at first one of your three standbys was running as
>>> synchronous standby, and all queries were executed normally. But
>>> when you started pg_basebackup, pg_basebackup unexpectedly
>>> got the role of synchronous standby from another standby. Since
>>> pg_basebackup doesn't send the information about replication
>>> progress back to the master, all queries (more precisely, transaction
>>> commit) got stuck, and kept waiting for the reply from synchronous
>>> standby.
>>>
>>> You can avoid this problem by setting synchronous_standby_names
>>> to the names of your standbys instead of '*'.
>>
>> I don't have synchronous_standby_names set at all. I'm only doing
>> asynchronous replication.
>
> Hmm... I have no idea about what happened on your environment, for now.
> Could you show me the self-contained test case?

I'm running the following, which gets piped over ssh to a remote
server (at gigabit ethernet speed):
pg_basebackup -v -D - -x -Ft -U postgres

One thing that I've discovered is that if I throttle back the speed of
what is getting piped to the remote server, that directly correlates
to the load on the server.

In response to

Re: pg_basebackup blocking all queries with horrible performance at 2012-06-12 17:49:11 from Fujii Masao

Responses

Re: pg_basebackup blocking all queries with horrible performance at 2012-06-12 18:39:23 from Magnus Hagander

Browse pgsql-admin by date

	From	Date	Subject
Next Message	Magnus Hagander	2012-06-12 18:39:23	Re: pg_basebackup blocking all queries with horrible performance
Previous Message	Peter Cheung	2012-06-12 18:37:27	Re: How to setup PostgreSQL using Windows Authentication?

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Magnus Hagander	2012-06-12 18:39:23	Re: pg_basebackup blocking all queries with horrible performance
Previous Message	Robert Haas	2012-06-12 18:35:39	Re: /proc/self/oom_adj is deprecated in newer Linux kernels