Re: [Patch] Optimize dropping of relation buffers using dlist

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: tsunakawa(dot)takay(at)fujitsu(dot)com
Cc: thomas(dot)munro(at)gmail(dot)com, k(dot)jamison(at)fujitsu(dot)com, amit(dot)kapila16(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, andres(at)anarazel(dot)de, robertmhaas(at)gmail(dot)com, tomas(dot)vondra(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [Patch] Optimize dropping of relation buffers using dlist
Date: 2020-10-22 08:50:36
Message-ID: 20201022.175036.529997219363894066.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Thu, 22 Oct 2020 07:31:55 +0000, "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com> wrote in
> From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
> > On Thu, Oct 22, 2020 at 7:33 PM Kyotaro Horiguchi
> > <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> > > Mmm. Not exact. The requirement here is that we must be certain that
> > > the we don't have a buffuer for blocks after the file size known to
> > > the process. While recoverying, If the first lseek() returned smaller
> > > size than actual, we cannot have a buffer for the blocks after the
> > > size. After we trncated or extended the file, we are certain that we
> > > don't have a buffer for unknown blocks.
> >
> > Thanks, I understand now. Something feels fragile about it, perhaps
> > because it's not really acting as a "cache" anymore despite its name,
> > but I see the logic now. It becomes the authoritative source of
> > information, even if the kernel decides to make our file smaller
> > asynchronously.
>
> Thank you Horiguchi-san, you are a savior! I was worried like the end of the world has come.
>
>
> > I think a synchronised file size cache wouldn't be enough to use this
> > trick outside the recovery process, because the initial value would
> > come from a call to lseek(), but unlike recovery, that wouldn't happen
> > *before* we start putting pages in the buffer pool. Also, if we one
> > day have a size-limited relcache, even recovery could get into
> > trouble, if it evicts the RelationData that holds the authoritative
> > nblocks value.
>
> That's too bad, because we hoped we may be able to various operations during normal operation (TRUNCATE, DROP TABLE/INDEX, DROP DATABASE, etc.) An honest man can't believe the system call, that's a hell.
>
> I'm probably being silly, but can't we avoid the problem by using fstat() instead of lseek(SEEK_END)? Would they return the same value from the i-node?
>
> Or, can't we just try to do BufTableLookup() one block after what smgrnblocks() returns?

Lossy smgrrelcache or relacache is not a hard obstacle. As the same
with the case of !accurate, we just give up the optimized dropping if
the relcache doesn't give the authoritative size.

By the way, heap scan finds the size of target relation using
smgrnblocks(). I'm not sure why we don't miss recently-extended pages
on a heap-scan? It seems to be possible that concurrent checkpoint
fsyncs relation files inbetween the extension and scanning and the
scanning gets smaller size than it really is.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2020-10-22 09:08:03 Re: parallel distinct union and aggregate support patch
Previous Message Amit Kapila 2020-10-22 08:39:08 Re: Track statistics for streaming of in-progress transactions