Re: Improve WALRead() to suck data directly from WAL buffers when possible

From: Andres Freund <andres(at)anarazel(dot)de>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>
Subject: Re: Improve WALRead() to suck data directly from WAL buffers when possible
Date: 2023-01-25 21:15:40
Message-ID: 20230125211540.zylu74dj2uuh3k7w@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-01-14 12:34:03 -0800, Andres Freund wrote:
> On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:
> > On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:
> > > Please review the attached v2 patch further.
> >
> > I'm still unclear on the performance goals of this patch. I see that it
> > will reduce syscalls, which sounds good, but to what end?
> >
> > Does it allow a greater number of walsenders? Lower replication
> > latency? Less IO bandwidth? All of the above?
>
> One benefit would be that it'd make it more realistic to use direct IO for WAL
> - for which I have seen significant performance benefits. But when we
> afterwards have to re-read it from disk to replicate, it's less clearly a win.

Satya's email just now reminded me of another important reason:

Eventually we should add the ability to stream out WAL *before* it has locally
been written out and flushed. Obviously the relevant positions would have to
be noted in the relevant message in the streaming protocol, and we couldn't
generally allow standbys to apply that data yet.

That'd allow us to significantly reduce the overhead of synchronous
replication, because instead of commonly needing to send out all the pending
WAL at commit, we'd just need to send out the updated flush position. The
reason this would lower the overhead is that:

a) The reduced amount of data to be transferred reduces latency - it's easy to
accumulate a few TCP packets worth of data even in a single small OLTP
transaction
b) The remote side can start to write out data earlier

Of course this would require additional infrastructure on the receiver
side. E.g. some persistent state indicating up to where WAL is allowed to be
applied, to avoid the standby getting ahead of th eprimary, in case the
primary crash-restarts (or has more severe issues).

With a bit of work we could perform WAL replay on standby without waiting for
the fdatasync of the received WAL - that only needs to happen when a) we need
to confirm a flush position to the primary b) when we need to write back pages
from the buffer pool (and some other things).

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2023-01-25 21:16:44 Re: GUCs to control abbreviated sort keys
Previous Message SATYANARAYANA NARLAPURAM 2023-01-25 20:27:30 Re: Improve WALRead() to suck data directly from WAL buffers when possible