From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: POC: Cleaning up orphaned files using undo logs |
Date: | 2019-08-09 16:13:15 |
Message-ID: | CA+Tgmoavrc6iHRrxqyoe-YSq6OzmGswyvKOWxZpn=ULtSUPyyQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Aug 7, 2019 at 6:57 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> Yeah, that's also a problem with complicated WAL record types. Hopefully
> the complex cases are an exception, not the norm. A complex case is
> unlikely to fit any pre-defined set of fields anyway. (We could look at
> how e.g. protobuf works, if this is really a big problem. I'm not
> suggesting that we add a dependency just for this, but there might be
> some patterns or interfaces that we could mimic.)
I think what you're calling the complex cases are going to be pretty
normal cases, not something exotic, but I do agree with you that
making the infrastructure more generic is worth considering. One idea
I had is to use the facilities from pqformat.h; have the generic code
read whatever the common fields are, and then pass the StringInfo to
the AM which can do whatever it wants with the rest of the record, but
probably these facilities would make it pretty easy to handle either a
series of fixed-length fields or alternatively variable-length data.
What do you think of that idea?
(That would not preclude doing compression on top, although I think
that feeding everything through pglz or even lz4/snappy may eat more
CPU cycles than we can really afford. The option is there, though.)
> If you remember, we did a big WAL format refactoring in 9.5, which moved
> some information from AM-specific structs to the common headers. Namely,
> the information on the relation blocks that the WAL record applies to.
> That was a very handy refactoring, and allowed tools like pg_waldump to
> print more detailed information about all WAL record types. For WAL
> records, moving the block information was natural, because there was
> special handling for full-page images anyway. However, I don't think we
> have enough experience with UNDO log yet, to know which fields would be
> best to include in the common undo header, and which to leave as
> AM-specific payload. I think we should keep the common header slim, and
> delegate to the AM routines.
Yeah, I remember. I'm not really sure I totally buy your argument that
we don't know what besides XID should go into an undo record: tuples
are a pretty important concept, and although there might be some
exceptions here and there, I have a hard time imagining that undo is
going to be primarily about anything other than identifying a tuple
and recording something you did to it. On the other hand, you might
want to identify several tuples, or identify a tuple with a TID that's
not 6 bytes, so that's a good reason for allowing more flexibility.
Another point in being favor of being more flexible is that it's not
clear that there's any use case for third-party tools that work using
undo. WAL drives replication and logical decoding and could be used
to drive incremental backup, but it's not really clear that similar
applications exist for undo. If it's just private to the AM, the AM
might as well be responsible for it. If that leads to code
duplication, we can create a library of common routines and AM users
can use them if they want.
> Hmm. If you're following an UNDO chain, from newest to oldest, I would
> assume that the newer record has enough information to decide whether
> you need to look at the previous record. If the previous record is no
> longer interesting, it might already be discarded away, after all.
I actually thought zedstore might need this pattern. If you store an
XID with each undo pointer, as the current zheap code mostly does,
then you have enough information to decide whether you care about the
previous undo record before you fetch it. But a tuple stores only an
undo pointer, and you determine that the undo isn't discarded, you
have to fetch the record first and then possibly decide that you had
the right version in the first place. Now, maybe that pattern doesn't
repeat, because the undo records could be set up to contain both an
XMIN and an XMAX, but not necessarily. I don't know exactly what you
have in mind, but it doesn't seem totally crazy that an undo record
might contain the XID that created that version but not the XID that
created the prior version, and if so, you'll iterate backwards until
you either hit the end of undo or go one undo record past the version
you can see.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Jeevan Ladhe | 2019-08-09 18:25:47 | Re: block-level incremental backup |
Previous Message | Jeff Davis | 2019-08-09 15:51:19 | Re: Add "password_protocol" connection parameter to libpq |