Quick Links

Re: Inconsistent DB data in Streaming Replication

From:	Ants Aasma <ants(at)cybertec(dot)at>
To:	Hannu Krosing <hannu(at)2ndquadrant(dot)com>
Cc:	Sameer Thakur <samthakur74(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, sthomas(at)optionshouse(dot)com, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Samrat Revagade <revagade(dot)samrat(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>
Subject:	Re: Inconsistent DB data in Streaming Replication
Date:	2013-04-11 15:09:32
Message-ID:	CA+CSw_vJU0CjAYthQCAQ3ZpLXKhhJxprZw1M_XtTn0dzCnWR7A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Apr 11, 2013 at 5:33 PM, Hannu Krosing <hannu(at)2ndquadrant(dot)com> wrote:
> On 04/11/2013 03:52 PM, Ants Aasma wrote:
>>
>> On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing <hannu(at)2ndquadrant(dot)com>
>> wrote:
>>>
>>> The proposed fix - halting all writes of data pages to disk and
>>> to WAL files while waiting ACK from standby - will tremendously
>>> slow down all parallel work on master.
>>
>> This is not what is being proposed. The proposed fix halts writes of
>> only data pages that are modified within the window of WAL that is not
>> yet ACKed by the slave. This means pages that were recently modified
>> and where the clocksweep or checkpoint has decided to evict them. This
>> only affects the checkpointer, bgwriter and backends doing allocation.
>> Furthermore, for the backend clocksweep case it would be reasonable to
>> just pick another buffer to evict. The slowdown for most actual cases
>> will be negligible.
>
> You also need to hold back all WAL writes, including the ones by
> parallel async and locally-synced transactions. Which means that
> you have to make all locally synced transactions to wait on the
> syncrep transactions committed before them.
> After getting the ACK from slave you then have a backlog of stuff
> to write locally, which then also needs to be sent to slave. Basically
> this turns a nice smooth WAL write-and-stream pipeline into a
> chunky wait-and-write-and-wait-and-stream-and-wait :P
> This may not be a problem in slight write load cases, which is
> probably the most widely happening usecase for postgres, but it
> will harm top performance and also force people to get much
> better (and more expensive) hardware than would otherways
> be needed.

Why would you need to hold back WAL writes? WAL is written on master
first and then steamed to slave as it is done now. You would only need
hold back dirty page evictions having a recent enough LSN to not yet
be replicated. This holding back is already done to wait for local WAL
flushes, see bufmgr.c:1976 and bufmgr.c:669. When a page gets dirtied
it's usage count gets bumped, so it will not be considered for
eviction for at least one clocksweep cycle. In normal circumstances
that will be enough time to get an ACK from the slave. When WAL is
generated at an higher rate than can be replicated this will not be
true. In that case backends that need to bring in new pages will have
to wait for WAL to be replicated before they can continue. That will
hopefully include the backends that are doing the dirtying, throttling
the WAL generation rate. This would definitely be optional behavior,
not something turned on by default.

>>
>>> And it does just turn around "master is ahead of slave" problem
>>> into "slave is ahead of master" problem :)
>>
>> The issue is not being ahead or behind. The issue is ensuring WAL
>> durability in the face of failovers before modifying data pages. This
>> is sufficient to guarantee no forks in the WAL stream from the point
>> of view of data files and with that the capability to always recover
>> by replaying WAL.
>
> How would this handle the case Tom pointed out, namely a short
> power recycling on master ?
>
> Instead of just continuing after booting up again the master now
> has to figure out if it had any slaves and then try to query them
> (for how long?) if they had any replayed WAL the master does
> not know of.

If the master is restarted and there is no failover to the slave, then
nothing strange would happen, master does recovery, comes up and
starts streaming to the slave again. If there is a failover, then
whatever is managing the failover needs to ensure that the master does
not come up again on its own before it is reconfigured as a slave.
This is what HA cluster managers do.

> Suddenly the pure existence of streaming replica slaves has become
> a problem for master !
>
> This will especially complicate the case of multiple slaves each
> having received WAL to a slightly different LSN ? And you do want
> to have at least 2 slaves if you want both durability
> and availability with syncrep.
>
> What if the one of slaves disconnects ? how should master react to this ?

Again, WAL replication will be the same as it is now. Availability
considerations, including what to do when slaves go away, are the same
as for current sync replication. Only required change is that we can
configure the master to hold out on writing any data pages that
contain changes that might go missing in the case of a failover.

Whether the additional complexity is worth the feature is a matter of
opinion. As we have no patch yet I can't say that I know what all the
implications are, but at first glance the complexity seems rather
compartmentalized. This would only amend what the concept of a WAL
flush considers safely flushed.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

In response to

Re: Inconsistent DB data in Streaming Replication at 2013-04-11 14:33:07 from Hannu Krosing

Responses

Re: Inconsistent DB data in Streaming Replication at 2013-04-11 17:35:08 from Fujii Masao
Re: Inconsistent DB data in Streaming Replication at 2013-04-12 05:48:01 from Pavan Deolasee

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Alvaro Herrera	2013-04-11 15:10:07	Re: ObjectClass/ObjectType mixup
Previous Message	Pavel Golub	2013-04-11 14:51:09	Re: [GSOC] questions about idea "rewrite pg_dump as library"