Re: Issues Outstanding for Point In Time Recovery (PITR)

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: "J(dot) R(dot) Nield" <jrnield(at)usol(dot)com>
Cc: PostgreSQL Hacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Issues Outstanding for Point In Time Recovery (PITR)
Date: 2002-07-05 17:20:37
Message-ID: 200207051720.g65HKbh22474@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

J. R. Nield wrote:
> On Fri, 2002-07-05 at 01:42, Bruce Momjian wrote:
> >
> > We have needed
> > point-in-time recovery for a long time,
>
> Most thanks should go to vadim (and whoever else worked on this), since
> his WAL code already does most of the work. The key thing is auditing
> the backend to look for every case where we assume some action is not
> visible until after commit, and therefore don't log its effects. Those
> are the main cases that must be changed.

Yep. Glad you can focus on that.

> > ---------------------------------------------------------------------------
> >
> > J. R. Nield wrote:
> > > Hello:
> > >
> > > I've got the logging system to the point where I can take a shutdown
> > > consistent copy of a system, and play forward through multiple
> > > checkpoints. It seems to handle CREATE TABLE/DROP TABLE/TRUNCATE
>
> But notably not for the btree indexes! It looked like they were working,
> because the files were there, and all indexes created before the backup
> would work under insert/delete (including sys catalog indexes). This is
> because btree insert/delete is logged, just not during build. So I
> missed that one case.
>
> You will end-up with up-to-date table data though, so it is something.
>
> Adding logging support to btbuild is the next step, and I don't think it
> should be too hard. I am working this now.

Great.

> It is also a major advantage that most everything in the system gets
> stored in the catalog tables, and so is logged already.
>
>
> > Uh, we do log pre-page writes to WAL to recover from partial page
> > writes to disk. Is there something more we need here?
> >
> > As for bad block detection, we have thought about adding a CRC to each
> > page header, or at least making it optional. WAL already has a CRC.
> >
>
> Yes this should be last to do, because it is not necessary for PITR,
> only for performance (the option not to write pre-images without fear of
> data loss).

Yep.

> > Yes, there are a few places where we actually create a file, and if the
> > server crashes, the file remains out there forever. We need to track that
> > better.
>
> OK, there is a bigger problem then just tracking the file though. We
> sometimes do stuff to that file that we don't log. We assume that if we
> commit, the file must be OK and will not need replay because the
> transaction would not have committed if the file was not in a commitable
> state. If we abort, the system never sees the file, so in a sense we
> undo everything we did to the file. It is a kind of poor-man's rollback
> for certain operations, like btbuild, create table, etc. But it means
> that we cannot recover the file from the log, even after a commit.

Yep.

> >
> > >
> > > ?1.1.1 - CREATE DATABASE is also unlogged
> > >
> > > This will cause the same replay problems as above.
> >
> > Yep. Again, seems a master cleanup on startup is needed.
>
> The cleanup is not the problem, only a nuisance. Creating the files
> during replay is the problem. I must recreate CREATE DATABASE from the
> log exactly as it was done originally. I think just logging the
> parameters to the command function should be sufficient, but I need to
> think more about it.

OK, makes sense. Nice when you can bundle a complex action into the
logging of one command and its parameters.

> >
> > > b) If TRUNCATE TABLE fails, the system must PANIC. Otherwise, the table
> > > may be used in a future command, and a replay-recovered database may
> > > end-up with different data than the original.
> >
> > We number based on oids. You mean oid wraparound could cause the file
> > to be used again?
>
> That's not what I meant. Let's say I issue 'TRUNCATE TABLE foo'. Then,
> right before smgrtruncate is called, I do an XLogInsert saying "Redo a
> TRUNCATE TABLE on foo to nblocks if we crash". Then smgrtruncate fails
> and we do an elog(ERROR)
>
> Now the user decides that since TRUNCATE TABLE didn't work, he might as
> well use the table, so he inserts some records into it, generating log
> entries.
>
> When I replay this log sequence later, what happens if the TRUNCATE
> succeeds instead of failing?

You mean the user is now accessing a partially truncated table? That's
just too weird. I don't see how the WAL would know how far truncation
had gone. I see why you would need the panic and it seems acceptable.

> I admit that there are other ways of handling it than to PANIC if the
> truncate fails. All the ones I can come up with seem to amount to some
> kind of ad-hoc UNDO log.

Yea, truncate failure seems so rare/impossible to happen, we can do a
panic and see if it ever happens to anyone. I bet it will not. Those
are usually cases of an OS crash, so it is the same as a panic.

> > > WAL must be flushed before truncate as well.
> > >
> > > WAL does not need to be flushed before create, if we don't mind
> > > leaking files sometimes.
> >
> > Cleanup?
>
> Yes, we could garbage-collect leaked files. XLogFlush is not that
> expensive though, so I don't have an opinion on this yet.

Right now, if we do CREATE TABLE, and the backend crashes, I think it
leaves a nonreferenced file around. Not something you need to worry
about for replication, I guess. We can address it later.

> > > c) Redo code should treat writes to non-existent files as an error.
> > > Changes affect heap & nbtree AM's. [Check others]
> >
> > Yep, once you log create/drop, if something doesn't match, it is an
> > error, while before, we could ignore it.
> >
> > > d) rtree [and GiST? WTF is GiST? ] is not logged. A replay recovery of
> > > a database should mark all the rtree indices as corrupt.
> > > [ actually we should do that now, are we? ]
> >
> > Known problem. Not sure what is being done. TODO has:
> >
> > * Add WAL index reliability improvement to non-btree indexes
> >
> > so it is a known problem, and we aren't doing anything about it. What
> > more can I say? ;-)
>
> Once the other stuff works reliably, I will turn to rtree logging, which
> I have looked at somewhat, although I could really use a copy of the
> paper it is supposed to be based on. I have not figured out GiST enough
> to work on it yet.

There has been talk of retiring rtree and using the GIST version of
rtree. I thought it had some advantages/disadvantages. I don't remember
for sure.

> > > Database will do infinite shutdown consistent system recovery from the
> > > online logs, if you manually munge the control file to set state ==
> > > DB_IN_PRODUCTION instead of DB_SHUTDOWNED.
> >
> > Wow, how did you get so far?
>
> Because it was almost there to start with :-)
>
> Besides, it sounded better before I realized there was still a remaining
> problem with btree logging to fix.

Our major problem is that we have a very few people who like to work
at this level in the code. Glad you are around.

> > > In the case of LSNLast, we check to see if pd_lsn == the lsn in the
> > > last 64 bits of the page. If not, we assume the page is corrupt from
> > > a partial write (although it could be something else).
> >
> > LSN?
>
> Log Sequence Number (XLogRecPtr)

Yep, I remembered later.

> > > IMPORTANT ASSUMPTION:
> > > The OS/disk device will never write both the first part and
> > > last part of a block without writing the middle as well.
> > > This might be wrong in some cases, but at least it's fast.
> > >
> > > ?2.2.4 - GUC Variables
> > >
> > > The user should be able to configure what method is used:
> > >
> > > block_checking_write_method = [ checksum | torn_page_flag | none ]
> > >
> > > Which method should be used for blocks we write?
> >
> > Do we want torn page flag? Seems like a pain to get that on every 512
> > byte section of the 8k page.
>
> Ok, this section (2.2) was badly written and hard to understand. What I
> am proposing is that we put a copy of the log sequence number, which is
> at the head of the page, into the 8 byte field that we are creating at
> the end of the page, in place of the CRC. The log sequence number
> increases every time the page is written (it already does this). I have
> called this method 'LSNLast' internally, and the user would call it the
> 'torn_page_flag' method.
>
> So when we read the page, we compare the Log Sequence Number at the
> beginning and end of the page, and if they are different we assume a
> torn page.

Excellent.

> This version is weaker than the MS one we were talking about, because it
> is not on every 512 byte section of the page, only the beginning and the
> end. I'm simply looking for a fast alternative to CRC64, that doesn't
> require massive reorganization of the page layout code.

Great idea, and cheap.

> > > The system needs to be able to prompt the system administrator to feed
> > > it more log files.
> > >
> > > TODO: More here
> >
> > Yes, we can have someone working on the GUI once the command-line
> > interface is defined.
>
> Yes, and the system must not allow any concurrent activity during
> recovery either. So it looks like a standalone backend operation.

OK, we can address this capability.

> > That is tricky. We have discussed it and your backup has to deal with
> > some pretty strange things that can happen while 'tar' is traversing the
> > directory.

OK, first, are you thinking of having the nightly backup operate at the
file system level, or accessing the pages through the PostgreSQL shared
buffers?

> Even if we shutdown before we copy the file, we don't want a file that

Oh, so you are thinking of some kind of tar while the db is shutdown,
and using that as the backup?

> hasn't been written to in 5 weeks before it was backed up to require
> five weeks of old log files to recover. So we need to track that
> information somehow, because right now if we scanned the blocks in the
> file looking for at the page LSN's, we greatest LSN we would see might
> be much older than where it would be safe to recover from. That is the
> biggest problem, I think.

You are saying, "How do we know what WAL records go with that backup
snapshot of the file?" OK, lets assume we are shutdown. You can grab
the WAL log info from pg_control using contrib/pg_controldata and that
tells you what WAL logs to roll forward when you need to PIT recover
that backup later. If you store that info in the first file you backup,
you can have that WAL pointer available for later recovery in case you
are restoring from that backup. Is that the issue?

What seems more complicated is doing the backup while the database is
active, and this may be a requirement for a final PITR solution. Some
think we can grab the WAL pointer at 'tar' start and replay that on the
backup even if the file changes during backup.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2002-07-05 17:50:05 Re: I am being interviewed by OReilly
Previous Message Nigel J. Andrews 2002-07-05 16:48:38 Re: I am being interviewed by OReilly