From: | Andres Freund <andres(at)2ndquadrant(dot)com> |
---|---|
To: | Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com> |
Cc: | Ants Aasma <ants(at)cybertec(dot)at>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Sameer Thakur <samthakur74(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, sthomas(at)optionshouse(dot)com, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Samrat Revagade <revagade(dot)samrat(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Inconsistent DB data in Streaming Replication |
Date: | 2013-04-12 10:59:45 |
Message-ID: | 20130412105945.GB5766@alap2.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2013-04-12 11:18:01 +0530, Pavan Deolasee wrote:
> On Thu, Apr 11, 2013 at 8:39 PM, Ants Aasma <ants(at)cybertec(dot)at> wrote:
>
> > On Thu, Apr 11, 2013 at 5:33 PM, Hannu Krosing <hannu(at)2ndquadrant(dot)com>
> > wrote:
> > > On 04/11/2013 03:52 PM, Ants Aasma wrote:
> > >>
> > >> On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing <hannu(at)2ndquadrant(dot)com>
> > >> wrote:
> > >>>
> > >>> The proposed fix - halting all writes of data pages to disk and
> > >>> to WAL files while waiting ACK from standby - will tremendously
> > >>> slow down all parallel work on master.
> > >>
> > >> This is not what is being proposed. The proposed fix halts writes of
> > >> only data pages that are modified within the window of WAL that is not
> > >> yet ACKed by the slave. This means pages that were recently modified
> > >> and where the clocksweep or checkpoint has decided to evict them. This
> > >> only affects the checkpointer, bgwriter and backends doing allocation.
> > >> Furthermore, for the backend clocksweep case it would be reasonable to
> > >> just pick another buffer to evict. The slowdown for most actual cases
> > >> will be negligible.
> > >
> > > You also need to hold back all WAL writes, including the ones by
> > > parallel async and locally-synced transactions. Which means that
> > > you have to make all locally synced transactions to wait on the
> > > syncrep transactions committed before them.
> > > After getting the ACK from slave you then have a backlog of stuff
> > > to write locally, which then also needs to be sent to slave. Basically
> > > this turns a nice smooth WAL write-and-stream pipeline into a
> > > chunky wait-and-write-and-wait-and-stream-and-wait :P
> > > This may not be a problem in slight write load cases, which is
> > > probably the most widely happening usecase for postgres, but it
> > > will harm top performance and also force people to get much
> > > better (and more expensive) hardware than would otherways
> > > be needed.
> >
> > Why would you need to hold back WAL writes? WAL is written on master
> > first and then steamed to slave as it is done now. You would only need
> > hold back dirty page evictions having a recent enough LSN to not yet
> > be replicated. This holding back is already done to wait for local WAL
> > flushes, see bufmgr.c:1976 and bufmgr.c:669. When a page gets dirtied
> > it's usage count gets bumped, so it will not be considered for
> > eviction for at least one clocksweep cycle. In normal circumstances
> > that will be enough time to get an ACK from the slave. When WAL is
> > generated at an higher rate than can be replicated this will not be
> > true. In that case backends that need to bring in new pages will have
> > to wait for WAL to be replicated before they can continue. That will
> > hopefully include the backends that are doing the dirtying, throttling
> > the WAL generation rate. This would definitely be optional behavior,
> > not something turned on by default.
> >
> >
> I agree. I don't think the proposes change would cause a lot of performance
> bottleneck since the proposal is to hold back writing of dirty pages until
> the WAL is replicated successfully to the standby. The heap pages are
> mostly written by the background threads often much later than the WAL for
> the change is written. So in all likelihood, there will be no wait
> involved. Of course, this will not be true for very frequently updated
> pages that must be written at a checkpoint.
I don't think that holds true at all. If you look at pg_stat_bgwriter in
any remotely bugs cluster with a hot data set over shared_buffers you'll
notice that a large percentage of writes will have been done by backends
themselves.
Yes, we need to improve on this, and we are talking about it right now
in another thread, but until thats solved this argumentation seems to
fall flat on its face.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Pavan Deolasee | 2013-04-12 11:28:44 | Re: Inconsistent DB data in Streaming Replication |
Previous Message | Andres Freund | 2013-04-12 10:57:19 | Re: Inconsistent DB data in Streaming Replication |