Re: POC: Cleaning up orphaned files using undo logs

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2019-08-22 13:46:59
Message-ID: CA+Tgmoa15jwH6jbywFd2m_V9n44d2ckLuYnVSbPZ+0EJgz8Q8Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 22, 2019 at 12:54 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> But why? It makes a *lot* more sense to have it in the beginning. I
> don't think bulk-fetch really requires it to be in the end - we can
> still process records forward on a page-by-page basis.

There are two separate needs here: to be able to go forward, and to be
able to go backward. We have the length at the end of each record not
because we're stupid, but so that we can back up. If we have another
way of backing up, then the thing to do is not to move that to
beginning of the record but to remove it entirely as unnecessary
wastage. We can also think about how to improve forward traversal.

Considering each problem separately:

For forward traversal, we could simplify things somewhat by having
only 3 decoding stages instead of N decoding stages. We really only
need (1) a stage for accumulating bytes until we have uur_info, and
then (2) a stage for accumulating bytes until we know the payload and
tuple lengths, and then (3) a stage for accumulating bytes until we
have the whole record. We have a lot more stages than that right now
but I don't think we really need them for anything. Originally we had
them so that we could do incremental decoding to find the transaction
header in the record, but now that the transaction header is at a
fixed offset, I think the multiplicity of stages is just baggage.

We could simplify things more by deciding that the first two bytes of
the record are going to contain the record size. That would increase
the size of the record by 2 bytes, but we could (mostly) claw those
bytes back by not storing the size of both uur_payload and uur_tuple.
The size of the other one would be computed by subtraction: take the
total record size, subtract the size of whichever of those two things
we store, subtract the mandatory and optional headers that are
present, and the rest must be the other value. That would still add 2
bytes for records that contain neither a payload nor a tuple, but that
would probably be OK given that (a) a lot of records wouldn't be
affected, (b) the code would get simpler, and (c) something like this
seems necessary anyway given that we want to make the record format
more generic. With this approach instead of 3 stages we only need 2:
(1) accumulating bytes until we have the 2-byte length word, and (2)
accumulating bytes until we have the whole record.

For backward traversal, as I see it, there are basically two options.
One is to do what we're doing right now, and store the record length
at the end of the record. (That might mean that a record both begins
and ends with its own length, which is not a crazy design.) The other
is to do what I think you are proposing here: locate the beginning of
the first record on the page, presumably based on some information
stored in the page header, and then work forward through the page to
figure out where all the records start. Then process them in reverse
order. That saves 2 bytes per record. It's a little more expensive in
terms of CPU cycles, especially if you only need some of the records
on the page but not all of them, but that's probably not too bad.

I think I'm basically agreeing with what you are proposing but I think
it's important to spell out the underlying concerns, because otherwise
I'm afraid we might think we have a meeting of the minds when we don't
really.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-08-22 13:55:03 Re: mingw32 floating point diff
Previous Message Paul Guo 2019-08-22 13:13:20 Re: standby recovery fails (tablespace related) (tentative patch and discussion)