Re: The Free Space Map: Problems and Opportunities

From: Hannu Krosing <hannuk(at)google(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Jan Wieck <jan(at)wi3ck(dot)info>, Gregory Smith <gregsmithpgsql(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, John Naylor <john(dot)naylor(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: The Free Space Map: Problems and Opportunities
Date: 2021-09-06 23:33:28
Message-ID: CAMT0RQQQzWsCbLWBwmUFEdnm7GuowZtThWngZBvDFUaeZdzxfw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

When I have been thinking of this type of problem it seems that the
latest -- and correct :) -- place which should do all kinds of
cleanup like removing aborted tuples, freezing committed tuples and
setting any needed hint bits would be background writer or CHECKPOINT.

This would be more PostgreSQL-like, as it moves any work not
immediately needed from the critical path, as an extension of how MVCC
for PostgreSQL works in general.

Another option could be one more background cleaner with "needs
cleanup" bitmap which is updated by "real work" backends to let the
cleaner know what needs to be done.

But doing it as part of checkpoint probably ends up with less WAL
writes in the end.

That said, CHECKPOINT can efficiently clean up only Heap pages, unless
we do some extra work to ensure buffercache eviction ordering so that
Heap is (almost) always cleaaned worst and has thus an option to also
clean index pointers which point to it.

There could be a possibility to do a small amount of cleanup -- enough
for TPC-C-like workloads, but not larger ones -- while waiting for the
next command to arrive from the client over the network. This of
course assumes that we will not improve our feeder mechanism to have
back-to-back incoming commands, which can already be done today, but
which I have seen seldom used.

Cheers,
-----
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.

On Mon, Sep 6, 2021 at 11:59 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Mon, Aug 16, 2021 at 5:15 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > This is actually quite simple -- the chaos that follows is where it
> > gets truly complicated (and not necessarily worth getting into just
> > yet). The update of a given order and its order lines entries takes
> > place hours later, within a delivery transaction.
>
> Update on this: aborted transactions play an important role in the
> chaos that we see with BenchmarkSQL/TPC-C.
>
> This is surprising, in one way. Only 1% of all transactions are
> aborted (per the TPC-C spec). But in another way it's not surprising.
> It's totally consistent with what I've been saying about open and
> closed pages. When a page that's initially allocated and opened by a
> backend becomes closed following some inserts, the page should be left
> in a more or less pristine state -- owning backends that open and
> close pages need to be "good citizens.
>
> The definition of "pristine" must also include "no already-aborted
> tuples" -- aborted xact tuples are a problem that we can and should
> nip in the bud. This is really an extension of what I've been saying
> about not disrupting naturally occuring temporal and spatial locality.
> We're far better off if the space reused by freeing aborted heap
> tuples is reused by the very next transaction running in the very same
> backend. Plus it's not that hard; we can know for sure that this space
> is immediately available, which is not generally true of pruning. And
> so this free space from aborted xact tuples is quite special.
>
> Even with the master branch the problems with the FSM are much less
> noticeable once you configure BenchmarkSQL to not abort any
> transactions [1]. You need to consider the whole picture to understand
> how the issue of aborted transactions has such an outsized negative
> impact on performance and space efficiency.
>
> It's a whole *chain* of events:
>
> * If just 1% of all transactions abort, that's enough to leave an
> average of about 1 aborted tuple on every "order" table page (which
> naturally has small heap tuples [2]), and about 10 on a minority of
> the really big "new order" table's pages. Maybe 1 in 10 or 12 "order
> line" table heap pages are affected in this way.
>
> * The overall impact is that small (but not tiny) pockets of free
> space are left randomly strewn all over the place. Not just LP_DEAD
> line pointer stubs -- whole entire aborted heap tuples.
>
> * In theory we could address this problem using opportunistic pruning.
> But in practice that simply cannot happen today, because we don't
> touch the inserted rows until literally hours later.
>
> The same rows will eventually have updates and selects, but that won't
> help -- too little, too late. That's just how opportunistic pruning
> works: currently you need some kind of scan node for opportunistic
> pruning to kick in, so continually inserting transactions (from new
> orders) are currently fundamentally unable to do any opportunistic
> pruning. Much less opportunistic pruning that kicks in at exactly the
> right time.
>
> * When an autovacuum worker runs against these tables, it will of
> course notice this diffuse free space from aborted tuples, and put it
> in the FSM.
>
> That free space will be reused, but by totally unrelated logical rows.
>
> What we really seem to need here is some rudimentary form of "eager
> physical rollback"; the really important thing seems to be to not
> allow the problem to happen in the first place. I have a basic
> prototype implementation, built on top of the larger prototype patch.
> It opportunistically prunes-away aborted heap tuples on target/open
> pages, bypassing the usual pd_prune_xid check we have in
> heap_page_prune_opt() -- that doesn't make sense with the new form of
> targeted, abort-orientated pruning. The new pruning mechanism makes a
> really big difference on its own, without even involving the FSM. The
> best cure for all kinds of bloat is prevention, which this comes
> pretty close to.
>
> I think that this general idea can be pushed further. I've talked
> about aborts, but what about the commit path? I don't see why we can't
> make inserting backends responsible for setting hint bits on recently
> inserted rows -- the cost of putting off that work can only go up, no
> matter what (unless maybe the user drops the table). Again, the
> missing piece seems to be a general sense of ownership of heap pages
> by transactions and/or backends. In order for a backend to be able to
> clean up its own mess after itself while it's still cheap, it has to
> understand what that actually means.
>
> Don't forget that the TPC-C workload doesn't have that many updates
> (at least not for the tables I'm focussed on here). Nothing that we
> tend to think of as an adversarial case for Postgres, except perhaps
> the updates against the "order" table, which are always non-HOT
> updates due to an issue with the indexes. That's the extent of it,
> though -- every other table easily has 99%+ of all updates use HOT
> (albeit with some heap fill factor tuning). We should *expect* long
> term stability here, based on the particulars of the workload; we only
> need to update every "order" row and every "order lines" row exactly
> once. After that they can just age out of shared_buffers forever
> (actually they might be read again, but never modified again, but
> reads were never the problem). The big "order line" table is by far
> the biggest problem, even though it manages to use HOT for practically
> all updates.
>
> [1] https://github.com/wieck/benchmarksql/commit/66b8db073545026dc76ef513d2b0e318d2f3d5a2
> [2] https://apps.dtic.mil/dtic/tr/fulltext/u2/a264793.pdf -- "Table 1:
> Summary of Logical Database"
> --
> Peter Geoghegan
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bossart, Nathan 2021-09-06 23:55:42 Re: Estimating HugePages Requirements?
Previous Message thomas 2021-09-06 22:21:13 Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert