Re: Why mark empty pages all visible?

From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Why mark empty pages all visible?
Date: 2023-03-28 22:29:50
Message-ID: 20230328222950.v7jzwfmrmpq4ntrp@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-03-28 13:05:19 -0700, Peter Geoghegan wrote:
> On Tue, Mar 28, 2023 at 12:01 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I will probably make that argument - so far I was just trying to understand
> > the intent of the current code. There aren't really comments explaining why we
> > want to mark currently-pinned empty/new pages all-visible and enter them into
> > the FSM.
>
> I don't think that not being able to immediately get a cleanup lock on
> a page signifies much of anything. I never have.

Why is that? It's as clear a signal of concurrent activity on the buffer
you're going to get.

> > Historically we did *not* enter currently pinned empty/new pages into the FSM
> > / VM. Afaics that's new as of 44fa84881fff.
>
> Of course that's true, but I don't know why that history is
> particularly important.

It's interesting to understand *why* we are doing what we are. I think it'd
make sense to propose changing how things work around this, but I just don't
feel like I have a good enough grasp for why we do some of the things we
do. And given there's not a lot of comments around it and some of the comments
that do exist are inconsistent with themselves, looking at the history seems
like the next best thing?

> I actually think that I might agree with the substance of much of what
> you're saying, but at the same time I don't think that you're defining
> the problem in a way that's particularly helpful.

Likely because the goals of the existing code aren't clear to me. So I don't
feel like I have a firm grasp...

> I gather that you *don't* want to do anything about the
> lazy_vacuum_heap_page behavior with setting empty pages all-visible (just
> the lazy_scan_new_or_empty behavior).

Not in the short-medium term, at least. In the long term I do suspect we might
want to do something about it. We have a *crapton* of contention in the FSM
and caused by the FSM in bulk workloads. With my relation extension patch
disabling the FSM nearly doubles concurrent load speed.

At the same time, the fact that we might loose knowledge about all the
existing free space in case of promotion or crash and never rediscover that
space (because the pages are frozen), seems decidedly not great.

I don't know what the path forward is, but it seems somewhat clear that we
ought to do something. I suspect having a not crash-safe FSM isn't really
acceptable anymore - it probably is fine to not persist a *reduction* in free
space, but we can't permanently loose track of free space, no matter how many
crashes.

I know that you strongly dislike the way the FSM works, although I forgot some
of the details.

One thing I am fairly certain about is that using the FSM to tell other
backends about newly bulk extended pages is not a good solution, even though
we're stuck with it for the moment.

> I actually agree that VACUUM is way too unconcerned about interfering
> with concurrent activity in terms of how it manages free space in the
> FSM. But this seems like just about the least important example of
> that (outside the context of your RelationGetBufferForTuple work). The
> really important case (that VACUUM gets wrong) all involve recently
> dead tuples. But I don't think that you want to talk about that right
> now. You want to talk about RelationGetBufferForTuple.

That's indeed the background. By now I'd also like to add a few comments
explaining why we do what we currently do, because I don't find all of it
obvious.

> > The reason I'm looking at this is that there's a lot of complexity at the
> > bottom of RelationGetBufferForTuple(), related to needing to release the lock
> > on the newly extended page and then needing to recheck whether there still is
> > free space on the page. And that it's not complicated enough
> > (c.f. INSERT_FROZEN doing visibilitymap_pin() with the page locked).
> >
> > As far as I can tell, if we went back to not entering new/empty pages into
> > either VM or FSM, we could rely on the page not getting filled while just
> > pinning, not locking it.
>
> What you're essentially arguing for is inventing a new rule that makes
> the early lifetime of a page (what we currently call a PageIsEmpty()
> page, and new pages) special, to avoid interference from VACUUM. I
> have made similar arguments myself on quite a few occasions, so I'm
> actually sympathetic. I just think that you should own it. And no, I'm
> not just reflexively defending my work in 44fa84881fff; I actually
> think that framing the problem as a case of restoring a previous
> behavior is confusing and ahistorical. If there was a useful behavior
> that was lost, then it was quite an accidental behavior all along. The
> difference matters because now you have to reconcile what you're
> saying with the lazy_vacuum_heap_page no-cleanup-lock behavior added
> in 14.

I really don't have a position to own yet, not on firm enough ground.

> I think that you must be arguing for making the early lifetime of a
> heap page special to VACUUM, since AFAICT you want to change VACUUM's
> behavior with strictly PageIsEmpty/PageIsNew pages only -- and *not*
> with pages that have one remaining LP_UNUSED item, but are otherwise
> empty (which could be set all-visible/all-frozen in either the first
> or second heap pass, even if we disabled the lazy_scan_new_or_empty()
> behavior you're complaining about). You seem to want to distinguish
> between very new pages (that also happen to be empty), and old pages
> that happen to be empty. Right?

I think that might be worthwhile, yes. The retry code in
RelationGetBufferForTuple() is quite hairy and almost impossible to test. If
we can avoid the complexity, at a fairly bound cost (vacuum needing to
re-visit new/empty pages if they're currently pinned), it'd imo be more that
worth the price.

But perhaps the better path forward is to just bite the bullet and introduce a
shared memory table of open files, that contains "content size" and "physical
size" for each relation. We've had a lot of things over the years that'd have
benefitted from that.

To address the RelationGetBufferForTuple() case, vacuum would simply not scan
beyond the "content size", so it'd never encounter the page that
RelationGetBufferForTuple() is currently dealing with. And we'd not need to
add bulk extended pages into the FSM.

This would also, I think, be the basis for teaching vacuum to truncate
relations without acquiring an AEL - which IME causes a lot of operational
issues.

It'd not do anything about loosing track of free space though :/.

> > I also do wonder whether the different behaviour of PageIsEmpty() and
> > PageIsNew() pages makes sense. The justification for not marking PageIsNew()
> > pages as all-visible holds just as true for empty pages. There's just as much
> > free space there.
>
> What you say here makes a lot of sense to me. I'm just not sure what
> I'd prefer to do about it.

You and me both...

I wonder what it'd take to make the FSM "more crashsafe". Leaving aside the
cases of "just extended" new/empty pages: We already call
XLogRecordPageWithFreeSpace() for HEAP2_VACUUM, HEAP2_VISIBLE,
XLOG_HEAP2_PRUNE as well as insert/multi_insert/update. However, we don't set
the LSN of FSM pages. Which means we'll potentially flush dirty FSM buffers to
disk, before the corresponding WAL records make it to disk.

ISTM that even if we just used the LSN of the last WAL record for
RecordPageWithFreeSpace()/LogRecordPageWithFreeSpace(), and perhaps also
called LogRecordPageWithFreeSpace() during XLOG_HEAP2_FREEZE replay, we'd
*drastically* shrink the chance of loosing track of free space. Obviously not
free, but ISTM that it can't add a lot of overhead.

I think we can loose the contents of individual leaf FSM pages, e.g. due to
checksum failures - but perhaps we could address that on access, e.g. by
removing the frozen bit for the corresponding heap pages, which'd lead us to
eventually rediscover the free space?

That'd still leve us with upper level corruption, but I guess we could just
recompute those in some circumstances?

Hm - it'd sure be nice if pg_buffercache would show the LSN of valid pages...

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jacob Champion 2023-03-28 22:33:25 Re: zstd compression for pg_dump
Previous Message Tomas Vondra 2023-03-28 22:02:00 Re: Add LZ4 compression in pg_dump