Re: [Patch] Optimize dropping of relation buffers using dlist

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: k(dot)jamison(at)fujitsu(dot)com
Cc: tsunakawa(dot)takay(at)fujitsu(dot)com, amit(dot)kapila16(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, andres(at)anarazel(dot)de, robertmhaas(at)gmail(dot)com, tomas(dot)vondra(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [Patch] Optimize dropping of relation buffers using dlist
Date: 2020-10-02 02:44:46
Message-ID: 20201002.114446.795674172958545821.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Thu, 1 Oct 2020 12:55:34 +0000, "k(dot)jamison(at)fujitsu(dot)com" <k(dot)jamison(at)fujitsu(dot)com> wrote in
> On Thursday, October 1, 2020 4:52 PM, Tsunakawa-san wrote:
>
> > From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
> > > I thought that the advantage of this optimization is that we don't
> > > need to visit all buffers? If we need to run a full-scan for any
> > > reason, there's no point in looking-up already-visited buffers again.
> > > That's just wastefull cycles. Am I missing somethig?
> > >
> > > I don't understand. If we chose to the optimized dropping, the reason
> > > is the number of buffer lookup is fewer than a certain threashold. Why
> > > do you think that the fork kind a buffer belongs to is relevant to the
> > > criteria?
> >
> > I rethought about this, and you certainly have a point, but... OK, I think I
> > understood. I should have thought in a complicated way. In other words,
> > you're suggesting "Let's simply treat all forks as one relation to determine
> > whether to optimize," right? That is, the code simple becomes:

Exactly. The concept of the threshold is that if we are expected to
repeat buffer look-up than that, we consider just one-time full-scan
more efficient. Since we know we are going to drop buffers of all (or
the specified) forks of the relation at once, the number of looking-up
is naturally the sum of the expected number of the buffers of all
forks.

> > whether to optimize," right? That is, the code simple becomes:
> >
> > Sums up the number of buffers to invalidate in all forks;
> > if (the cached sizes
> > of all forks are valid && # of buffers to invalidate < THRESHOLD) {
> > do the optimized way;
> > return;
> > }
> > do the traditional way;
> >
> > This will be simple, and I'm +1.

Thanks!

> This is actually close to the v18 I posted trying Horiguchi-san's approach, but that
> patch had bug. So attached is an updated version (v20) trying this approach again.
> I hope it's bug-free this time.

Thaks for the new version.

- * XXX currently it sequentially searches the buffer pool, should be
- * changed to more clever ways of searching. However, this routine
- * is used only in code paths that aren't very performance-critical,
- * and we shouldn't slow down the hot paths to make it faster ...
+ * XXX The relation might have extended before this, so this path is

The following description is found in the comment for FlushRelationBuffers.

> * XXX currently it sequentially searches the buffer pool, should be
> * changed to more clever ways of searching. This routine is not
> * used in any performance-critical code paths, so it's not worth
> * adding additional overhead to normal paths to make it go faster;
> * but see also DropRelFileNodeBuffers.

This looks like to me "We won't do that kind of optimization for
FlushRelationBuffers, but DropRelFileNodeBuffers would need it". If
so, don't we need to revise the comment together?

- * XXX currently it sequentially searches the buffer pool, should be
- * changed to more clever ways of searching. However, this routine
- * is used only in code paths that aren't very performance-critical,
- * and we shouldn't slow down the hot paths to make it faster ...
+ * XXX The relation might have extended before this, so this path is
+ * only optimized during recovery when we can get a reliable cached
+ * value of blocks for specified relation. In addition, it is safe to
+ * do this since there are no other processes but the startup process
+ * that changes the relation size during recovery. Otherwise, or if
+ * not in recovery, proceed to usual invalidation process, where it
+ * sequentially searches the buffer pool.

This should no longer be a XXX comment. It seems to me somewhat
describing too-detailed at this function's level. How about something
like the follwoing? (excpet its syntax, or phrasing:p)

===
If the expected maximum number of buffers to drop is small enough
compared to NBuffers, individual buffers are located by
BufTableLookup. Otherwise we scan through all buffers. Snnce we
mustn't leave a buffer behind, we take the latter way unless the
number is not reliably identified. See smgrcachednblocks() for
details.
===

(I'm still mildly opposed to the function name, which seems exposing
detail too much.)

+ * Get the total number of cached blocks and to-be-invalidated blocks
+ * of the relation. If a fork's nblocks is not valid, break the loop.

The number of file blocks is not usually equal to the number of
existing buffers for the file. We might need to explain that
limitation here.

+ for (j = 0; j < nforks; j++)

Though I understand that j is considered to be in a connection with
fork number, I'm a bit uncomfortable that j is used for the outmost
loop..

+ for (curBlock = firstDelBlock[j]; curBlock < nTotalBlocks; curBlock++)

Mmm. We should compare curBlock with the number of blocks of the fork,
not the total of all forks.

+ uint32 newHash; /* hash value for newTag */
+ BufferTag newTag; /* identity of requested block */
+ LWLock *newPartitionLock; /* buffer partition lock for it */

It seems to be copied from somewhere, but the buffer is not new at
all.

+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
+ bufHdr->tag.forkNum == forkNum[j] &&
+ bufHdr->tag.blockNum == curBlock)
+ InvalidateBuffer(bufHdr); /* releases spinlock */

I think it cannot happen that the block is used for a different block
of the same relation-fork, but it could be safer to check
bufHdr->tag.blockNum >= firstDelBlock[j] instead.

+/*
+ * smgrcachednblocks() -- Calculate the number of blocks that are cached in
+ * the supplied relation.
+ *
+ * It is equivalent to calling smgrnblocks, but only used in recovery for now
+ * when DropRelFileNodeBuffers() is called. This ensures that only cached value
+ * is used which is always valid in recovery, since there is no shared
+ * invalidation mechanism that is implemented yet for changes in file size.
+ *
+ * This returns an InvalidBlockNumber when smgr_cached_nblocks is not available
+ * and when not in recovery.

Isn't it too concrete? We need to mention the buggy-kernel issue here
rahter than that of callers.

And if the comment is correct, we should Assert(InRecovery) at the
beggining of this function.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2020-10-02 03:02:17 Re: Why does PostgresNode.pm set such a low value of max_wal_senders?
Previous Message Ian Barwick 2020-10-02 02:14:19 Re: Improving connection scalability: GetSnapshotData()