Re: [Patch] Optimize dropping of relation buffers using dlist

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: "k(dot)jamison(at)fujitsu(dot)com" <k(dot)jamison(at)fujitsu(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [Patch] Optimize dropping of relation buffers using dlist
Date: 2020-08-07 03:37:55
Message-ID: CAA4eK1LP76Uqpk7RfUjVyL_7cO9Rn=aahMQxB=JuMS_82J5NwQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 6, 2020 at 6:53 AM k(dot)jamison(at)fujitsu(dot)com
<k(dot)jamison(at)fujitsu(dot)com> wrote:
>
> On Saturday, August 1, 2020 5:24 AM, Andres Freund wrote:
>
> Hi,
> Thank you for your constructive review and comments.
> Sorry for the late reply.
>
> > Hi,
> >
> > On 2020-07-31 15:50:04 -0400, Tom Lane wrote:
> > > Andres Freund <andres(at)anarazel(dot)de> writes:
> > > > Indeed. The buffer mapping hashtable already is visible as a major
> > > > bottleneck in a number of workloads. Even in readonly pgbench if s_b
> > > > is large enough (so the hashtable is larger than the cache). Not to
> > > > speak of things like a cached sequential scan with a cheap qual and wide
> > rows.
> > >
> > > To be fair, the added overhead is in buffer allocation not buffer lookup.
> > > So it shouldn't add cost to fully-cached cases. As Tomas noted
> > > upthread, the potential trouble spot is where the working set is
> > > bigger than shared buffers but still fits in RAM (so there's no actual
> > > I/O needed, but we do still have to shuffle buffers a lot).
> >
> > Oh, right, not sure what I was thinking.
> >
> >
> > > > Wonder if the temporary fix is just to do explicit hashtable probes
> > > > for all pages iff the size of the relation is < s_b / 500 or so.
> > > > That'll address the case where small tables are frequently dropped -
> > > > and dropping large relations is more expensive from the OS and data
> > > > loading perspective, so it's not gonna happen as often.
> > >
> > > Oooh, interesting idea. We'd need a reliable idea of how long the
> > > relation had been (preferably without adding an lseek call), but maybe
> > > that's do-able.
> >
> > IIRC we already do smgrnblocks nearby, when doing the truncation (to figure out
> > which segments we need to remove). Perhaps we can arrange to combine the
> > two? The layering probably makes that somewhat ugly :(
> >
> > We could also just use pg_class.relpages. It'll probably mostly be accurate
> > enough?
> >
> > Or we could just cache the result of the last smgrnblocks call...
> >
> >
> > One of the cases where this type of strategy is most intersting to me is the partial
> > truncations that autovacuum does... There we even know the range of tables
> > ahead of time.
>
> Konstantin tested it on various workloads and saw no regression.
> But I understand the sentiment on the added overhead on BufferAlloc.
> Regarding the case where the patch would potentially affect workloads that fit into
> RAM but not into shared buffers, could one of Andres' suggested idea/s above address
> that, in addition to this patch's possible shared invalidation fix? Could that settle
> the added overhead in BufferAlloc() as temporary fix?
>

Yes, I think so. Because as far as I can understand he is suggesting
to do changes only in the Truncate/Vacuum code path. Basically, I
think you need to change DropRelFileNodeBuffers or similar functions.
There shouldn't be any change in the BufferAlloc or the common code
path, so there is no question of regression in such cases. I am not
sure what you have in mind for this but feel free to clarify your
understanding before implementation.

> Thomas Munro is also working on caching relation sizes [1], maybe that way we
> could get the latest known relation size. Currently, it's possible only during
> recovery in smgrnblocks.
>
> Tom Lane wrote:
> > But aside from that, I noted a number of things I didn't like a bit:
> >
> > * The amount of new shared memory this needs seems several orders of
> > magnitude higher than what I'd call acceptable: according to my measurements
> > it's over 10KB per shared buffer! Most of that is going into the
> > CachedBufTableLock data structure, which seems fundamentally misdesigned ---
> > how could we be needing a lock per map partition *per buffer*? For comparison,
> > the space used by buf_table.c is about 56 bytes per shared buffer; I think this
> > needs to stay at least within hailing distance of there.
> >
> > * It is fairly suspicious that the new data structure is manipulated while holding
> > per-partition locks for the existing buffer hashtable.
> > At best that seems bad for concurrency, and at worst it could result in deadlocks,
> > because I doubt we can assume that the new hash table has partition boundaries
> > identical to the old one.
> >
> > * More generally, it seems like really poor design that this has been written
> > completely independently of the existing buffer hash table.
> > Can't we get any benefit by merging them somehow?
>
> The original aim is to just shorten the recovery process, and eventually the speedup
> on both vacuum and truncate process are just added bonus.
> Given that we don't have a shared invalidation mechanism in place yet like radix tree
> buffer mapping which is complex, I thought a patch like mine could be an alternative
> approach to that. So I want to improve the patch further.
> I hope you can help me clarify the direction, so that I can avoid going farther away
> from what the community wants.
> 1. Both normal operations and recovery process
> 2. Improve recovery process only
>

I feel Andres's suggestion will help in both cases.

> > I wonder if you have considered case of local hash (maintained only during recovery)?
> > If there is after-crash recovery, then there will be no concurrent
> > access to shared buffers and this hash will be up-to-date.
> > in case of hot-standby replica we can use some simple invalidation (just
> > one flag or counter which indicates that buffer cache was updated).
> > This hash also can be constructed on demand when DropRelFileNodeBuffers
> > is called first time (so w have to scan all buffers once, but subsequent
> > drop operation will be fast.
>
> I'm examining this, but I am not sure if I got the correct understanding. Please correct
> me if I'm wrong.
> I think above is a suggestion wherein the postgres startup process uses local hash table
> to keep track of the buffers of relations. Since there may be other read-only sessions which
> read from disk, evict cached blocks, and modify the shared_buffers, the flag would be updated.
> We could do it during recovery, then release it as recovery completes.
>
> I haven't looked deeply yet into the source code but we maybe can modify the REDO
> (main redo do-while loop) in StartupXLOG() once the read-only connections are consistent.
> It would also be beneficial to construct this local hash when DropRefFileNodeBuffers()
> is called for the first time, so the whole share buffers is scanned initially, then as
> you mentioned subsequent dropping will be fast. (similar behavior to what the patch does)
>
> Do you think this is feasible to be implemented? Or should we explore another approach?
>

I think we should try what Andres is suggesting as that seems like a
promising idea and can address most of the common problems in this
area but if you feel otherwise, then do let us know.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2020-08-07 03:46:19 Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
Previous Message Justin Pryzby 2020-08-07 03:33:37 Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)