From: | Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> |
---|---|
To: | Jeff Davis <pgsql(at)j-davis(dot)com> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
Subject: | Re: Improve WALRead() to suck data directly from WAL buffers when possible |
Date: | 2023-11-03 14:53:30 |
Message-ID: | CALj2ACUXN42H=_Jzw4oYZxoW0gyx82gALJ7vicz9UOuLVRbjnw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Nov 3, 2023 at 12:35 PM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> On Thu, 2023-11-02 at 22:38 +0530, Bharath Rupireddy wrote:
> > > I suppose the question is: should reading from the WAL buffers an
> > > intentional thing that the caller does explicitly by specific
> > > callers?
> > > Or is it an optimization that should be hidden from the caller?
> > >
> > > I tend toward the former, at least for now.
> >
> > Yes, it's an optimization that must be hidden from the caller.
>
> As I said, I tend toward the opposite: that specific callers should
> read from the buffers explicitly in the cases where it makes sense.
How about adding a bool flag (read_from_wal_buffers) to
XLogReaderState so that the callers can set it if they want this
facility via XLogReaderAllocate()?
> > At any given point of time, WAL buffer pages are maintained as a
> > circularly sorted array in an ascending order from
> > OldestInitializedPage to InitializedUpTo (new pages are inserted at
> > this end).
>
> I don't see any reference to OldestInitializedPage or anything like it,
> with or without your patch. Am I missing something?
OldestInitializedPage is introduced in v14-0001 patch. Please have a look.
> > - Read 6 pages starting from LSN 80. Nothing is read from WAL buffers
> > as the page at LSN 80 doesn't exist despite other 5 pages starting
> > from LSN 90 exist.
>
> This is what I imagined a "partial hit" was: read the 5 pages starting
> at 90. The caller would then need to figure out how to read the page at
> LSN 80 from the segment files.
>
> I am not saying we should support this case; perhaps it doesn't matter.
> I'm just describing why that term was confusing for me.
Okay. Current patch doesn't support this case.
> > If a caller asks for an unflushed WAL read
> > intentionally or unintentionally, XLogReadFromBuffers() reads only 4
> > pages starting from LSN 150 to LSN 180 and will leave the remaining 2
> > pages for the caller to deal with. This is the partial hit that can
> > happen.
>
> To me that's more like an EOF case. "Partial hit" sounds to me like the
> data exists but is not available in the cache (i.e. go to the segment
> files); whereas if it encountered the end, the data is not available at
> all.
Right. We can tweak the comments around "partial hit" if required.
> > WALBufMappingLock protects both xlblocks and WAL buffer pages [1][2].
> > I'm not sure how using the memory barrier, not WALBufMappingLock,
> > prevents writers from replacing WAL buffer pages while readers
> > reading
> > the pages.
>
> It doesn't *prevent* that case, but it does *detect* that case. We
> don't want to prevent writers from replacing WAL buffers, because that
> would mean we are slowing down the critical WAL writing path.
>
> Let me explain the potential problem cases, and how the barriers
> prevent them:
>
> Potential problem 1: the page is not yet resident in the cache at the
> time the memcpy begins. In this case, the first read barrier would
> ensure that the page is also not yet resident at the time xlblocks[idx]
> is read into endptr1, and we'd break out of the loop.
>
> Potential problem 2: the page is evicted before the memcpy finishes. In
> this case, the second read barrier would ensure that the page was also
> evicted before xlblocks[idx] is read into endptr2, and again we'd
> detect the problem and break out of the loop.
Understood.
> I assume here that, if xlblocks[idx] holds the endPtr of the desired
> page, all of the bytes for that page are resident at that moment. I
> don't think that's true right now: AdvanceXLInsertBuffers() zeroes the
> old page before updating xlblocks[nextidx].
Right.
> I think it needs something like:
>
> pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], InvalidXLogRecPtr);
> pg_write_barrier();
>
> before the MemSet.
I think it works. First, xlblocks needs to be turned to an array of
64-bit atomics and then the above change. With this, all those who
reads xlblocks with or without WALBufMappingLock also need to check if
xlblocks[idx] is ever InvalidXLogRecPtr and take appropriate action.
I'm sure you have seen the following. It looks like I'm leaning
towards the claim that it's safe to read xlblocks without
WALBufMappingLock. I'll put up a patch for these changes separately.
/*
* Make sure the initialization of the page becomes visible to others
* before the xlblocks update. GetXLogBuffer() reads xlblocks without
* holding a lock.
*/
pg_write_barrier();
*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;
I think the 3 things that helps read from WAL buffers without
WALBufMappingLock are: 1) couple of the read barriers in
XLogReadFromBuffers, 2) atomically initializing xlblocks[idx] to
InvalidXLogRecPtr plus a write barrier in AdvanceXLInsertBuffer(), 3)
the following sanity check to see if the read page is valid in
XLogReadFromBuffers(). If it sounds sensible, I'll work towards coding
it up. Thoughts?
+ , we
+ * need to ensure that we are not reading a page that just got
+ * initialized. For this, we look at the needed page header.
+ */
+ phdr = (XLogPageHeader) page;
+
+ /* Return, if WAL buffer page doesn't look valid. */
+ if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+ phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+ phdr->xlp_tli == tli))
+ break;
+
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From | Date | Subject | |
---|---|---|---|
Next Message | Nikita Malakhov | 2023-11-03 15:20:03 | Re: remaining sql/json patches |
Previous Message | Tom Lane | 2023-11-03 14:35:44 | Re: Regression on pg_restore to 16.0: DOMAIN not available to SQL function |