From: | Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> |
---|---|
To: | "J(dot) R(dot) Nield" <jrnield(at)usol(dot)com> |
Cc: | PostgreSQL Hacker <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Issues Outstanding for Point In Time Recovery (PITR) |
Date: | 2002-07-05 05:42:28 |
Message-ID: | 200207050542.g655gSx08615@candle.pha.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
I noticed no one has responded to your questions yet.
I think it is because we are sort of in shock. We have needed
point-in-time recovery for a long time, but the people who were capable
of doing it weren't able to focus on it. Then, all of a sudden, we get
an email from someone who is focusing on it and wants to get the job
done. GREAT!
I will give you my short analysis and see how many other questions I can
answer.
We have always known there was a way to do PITR with WAL, but WAL needed
a few extra pieces of information. Unfortunately, we weren't able to
analyze what was required. Seems you have gotten very far here, and
that is great.
Also, we thought about having PITR as part of replication (reply of
replication traffic logs) but having it tied to WAL is much cleaner and
has better performance, I bet.
I will do whatever I can to help. My chat addresses are:
AIM bmomjian
ICQ 151255111
Yahoo bmomjian
MSN root(at)candle(dot)pha(dot)pa(dot)us
IRC #postgresql vis efnet
---------------------------------------------------------------------------
J. R. Nield wrote:
> Hello:
>
> I've got the logging system to the point where I can take a shutdown
> consistent copy of a system, and play forward through multiple
> checkpoints. It seems to handle CREATE TABLE/DROP TABLE/TRUNCATE
Yes, that was always something we knew was lacking in the current WAL
contents.
> properly, and things are moving forward well. Recovery to an arbitrary
> point-in-time should be just as easy, but will need some administrative
> interface for it.
The adminstrative part can be done easily. We can get that part done.
It is the low-level stuff we always needed help with.
> At this point, some input would be useful on how I should handle things.
>
> The most important questions that need answering are in sections 2 & 5,
> since they impact the most other parts of the system. They will also
> require good documentation for sysadmins.
> ?0 - Introduction
>
> This file is where I'm keeping track of all the issues I run into while
> trying to get PITR to work properly. Hopefully it will evolve into a
> description of how PITR actually works once it is implemented.
>
> I will also try to add feedback as it comes in.
>
> The big items so-far are:
> ?1 - Logging Relation file creation, truncation, and removal
> This is mostly done. Can do infinte play-forward from
> online logs.
Excellent!
> ?2 - Partial-Write and Bad Block detection
> Need input before starting. Migration issues.
Uh, we do log do pre-page writes to WAL to recover from partial page
writes to disk. Is there something more we need here?
As for bad block detection, we have thought about adding a CRC to each
page header, or at least making it optional. WAL already has a CRC.
> ?3 - Detecting Shutdown Consistent System Recovery
> Mostly done.
> ?4 - Interactive Play-Forward Recovery for an Entire System
> Need input before starting.
You mean user interface? My idea would be to just get some command-line
tool working and we can write some GUI app to manage it and use the
command-line tool as an interface into the system.
> ?5 - Individual file consistent recovery
> Need input. Semi-Major changes required.
>
OK, here are the specific questions. Got it.
> ?1 - Logging Relation file creation, truncation, and removal
>
> ?1.1 - Problem:
>
> Without file creation in the log, we can't replay committed
> transactions that create relations.
>
> The current code assumes that any transaction reaching commit has already
> ensured it's files exist, and that those files will never be removed. This
> is true now, but not for log-replay from an old backup database system.
>
> The current XLOG code silently ignores block-write requests for
> non-existent files, and assumes that the transaction generating those
> requests must have aborted.
>
> Right now a crash during TRUNCATE TABLE will leave the table in an
> inconsistent state (partially truncated). This would not work when doing
> replay from before the last checkpoint.
Yes, there are a few places where we actually create file, and if the
server crashes, the file remains out the forever. We need to track that
better. I think we may need some utility that compares pg_class with
the files in the directory and cleans out unused files on server
startup. I started working on such code as part of VACUUM but it made
too many assumptions because it knew other backends were working at the
same time. On recovery, you don't have that problem and can easily do
almost an 'ls' and clean out just left over from the crash. Seems that
would solve several of those problems.
>
> ?1.1.1 - CREATE DATABASE is also unlogged
>
> This will cause the same replay problems as above.
Yep. Again, seems a master cleanup on startup is needed.
> ?1.2 - Proposal:
>
> a) Augment the SMGR code to log relation file operations, and to handle
> redo requests properly. This is simple in the case of create. Drop must be
> logged only IN the commit record. For truncate see (b).
Yep, we knew we needed that.
> The 'struct f_smgr' needs new operations 'smgr_recreate', 'smgr_reunlink',
> and 'smgr_retruncate'. smgr_recreate should accept a RelFileNode instead
> of a Relation.
No problem. Clearly required.
> Transactions that abort through system failure (ie. unlogged aborts)
> will simply continue to leak files.
Yep, need a cleanup process on start.
> b) If TRUNCATE TABLE fails, the system must PANIC. Otherwise, the table
> may be used in a future command, and a replay-recovered database may
> end-up with different data than the original.
We number based on oids. You mean oid wraparound could cause the file
to be used again?
> WAL must be flushed before truncate as well.
>
> WAL does not need to be flushed before create, if we don't mind
> leaking files sometimes.
Cleanup?
> c) Redo code should treat writes to non-existent files as an error.
> Changes affect heap & nbtree AM's. [Check others]
Yep, once you log create/drop, if something doesn't match, it is an
error, while before, we could ignore it.
> d) rtree [and GiST? WTF is GiST? ] is not logged. A replay recovery of
> a database should mark all the rtree indices as corrupt.
> [ actually we should do that now, are we? ]
Known problem. Not sure what is being done. TODO has:
* Add WAL index reliability improvement to non-btree indexes
so it is a known problem, and we aren't doing anything about it. What
more can I say? ;-)
> e) CREATE DATABASE must be logged properly, not use system(cp...)
OK, should be interesting.
> ?1.3 - Status:
>
> All logged SMGR operations are now in a START_CRIT_SECTION()/
> END_CRIT_SECTION() pair enclosing the XLogInsert() and the underlying fs
> operations.
>
> Code has been added to smgr and xact modules to log:
> create (no XLogFlush)
> truncate (XLogFlush)
> pending deletes on commit record
> files to delete on abort record
>
> Code added to md.c to support redo ops
>
> Code added to smgr for RMGR redo/desc callbacks
>
> Code added to xact RMGR callbacks for redo/desc
>
> Database will do infinite shutdown consistent system recovery from the
> online logs, if you manually munge the control file to set state ==
> DB_IN_PRODUCTION instead of DB_SHUTDOWNED.
Wow, how did you get so far?
> Still need to do:
> Item (c), recovery cleanup in all AM's
> Item (d), logging in other index AM's
> Item (e), CREATE DATABASE stuff
>
>
>
> ?2 - Partial-Write and Bad Block detection
>
> ?2.1 - Problem:
>
> In order to protect against partial writes without logging pages
> twice, we need to detect partial pages in system files and report them
> to the system administrator. We also might want to be able to detect
> damaged pages from other causes, like memory corruption, OS errors,
> etc. or in the case where the disk doesn't report bad blocks, but
> returns bad data.
Interesting. We just had a discussion about MSSQL page tare bits on
every 512-byte block that are set to the same value before the write.
On recover, if any the bits in a block are different, they recommend
recover using PITR. We don't have PITR (yet) so there was no need to
implement it (we just logged whole pages to WAL before writing). I
think we may go with just a CRC per page for partial-write detection.
> We should also decide what should happen when a file is marked as
> containing corrupt pages, and requires log-archive recovery from a
> backup.
We can offer the option of no-wal change-page writing which will require
PITR on partial-write detection or they can keep the existing system
with the performance hit.
>
> ?2.2 - Proposal:
>
> Add a 1 byte 'pd_flags' field to PageHeaderData, with the following
> flag definitions:
>
> PD_BLOCK_CHECKING (1)
> PD_BC_METHOD_BIT (1<<1)
>
> PageHasBlockChecking(page) ((page)->pd_flags & PD_BLOCK_CHECKING)
> PageBCMethodIsCRC64(page) ((page)->pd_flags & PD_BC_METHOD_BIT)
> PageBCMethodIsLSNLast(page) (!PageBCMethodIsCRC64(page))
>
> The last 64 bits of a page are reserved for use by the block checking
> code.
OK, so you already are on the CRC route.
> [ Is it worth the trouble to allow the last 8 bytes of a
> page to contain data when block checking is turned off for a Page?
> This proposal does not allow that. ]
You can leave the 8-byte empty if no CRC. You may want to turn CRC
on/off without dump.
> If the block checking method is CRC64, then that field will contain
> the CRC64 of the block computed at write time.
Cool.
> If the block checking method is LSNLast, then the field contains a
> duplicate of the pd_lsn field.
>
> ?2.2.1 - Changes to Page handling routines
>
> All the page handling routines need to understand that
> pd_special == (pd_special - (specialSize + 8))
>
> Change header comment in bufpage.h to reflect this.
Yes, we should add a format version to the heap page tail anyway like
btree has, i.e. some constant on every page that describes the format
used in that PostgreSQL version.
> ?2.2.2 - When Reading a Page
>
> Block corruption is detected on read in the obvious way with CRC64.
>
> In the case of LSNLast, we check to see if pd_lsn == the lsn in the
> last 64 bits of the page. If not, we assume the page is corrupt from
> a partial write (although it could be something else).
LSN?
> IMPORTANT ASSUMPTION:
> The OS/disk device will never write both the first part and
> last part of a block without writing the middle as well.
> This might be wrong in some cases, but at least it's fast.
>
> ?2.2.4 - GUC Variables
>
> The user should be able to configure what method is used:
>
> block_checking_write_method = [ checksum | torn_page_flag | none ]
>
> Which method should be used for blocks we write?
Do we want torn page flag? Seems like a pain to get that on every 512
byte section of the 8k page.
> check_blocks_on_read = [ true | false ]
>
> When true, verify that the blocks we read are not corrupt, using
> whatever method is in the block header.
>
> When false, ignore the block checking information.
>
Good idea. We always check on crash, but check on read only when set.
Good for detecting hardware problems.
> ?2.3 - Status:
>
> Waiting for input from pgsql-hackers.
>
> Questions:
>
> Should we allow the user to have more detailed control over
> which parts of a database use block checking?
I don't think that is needed; installation-wide settings are fine.
> For example: use 'checksum' on all system catalogs in all databases,
> 'torn_page_flag' on the non-catalog parts of the production database,
> and 'none' on everything else?
Too complicated. Let's get it implemented and in the field and see what
people ask for.
> ?3 - Detecting Shutdown Consistent System Recovery
>
> ?3.1 - Problem:
>
> How to notice that we need to do log-replay for a system backup, when the
> restored control file points to a shutdown checkpoint record that is
> before the most recent checkpoint record in the log, and may point into
> an archived file.
>
> ?3.2 - Proposal:
>
> At startup, after reading the ControlFile, scan the log directory to
> get the list of active log files, and find the lowest logId and
> logSeg of the files. Ensure that the files cover a contiguous range
> of LSN's.
>
> There are three cases:
>
> 1) ControlFile points to the last valid checkpoint (either
> checkPoint or prevCheckPoint, but one of them is the greatest
> valid checkpoint record in the log stream).
>
> 2) ControlFile points to a valid checkpoint record in an active
> log file, but there are more valid checkpoint records beyond
> it.
>
> 3) ControlFile points to a checkpoint record that should be in the
> archive logs, and is presumably valid.
>
> Case 1 is what we handle now.
>
> Cases 2 and 3 would result from restoring an entire system from
> backup in preparation to do a play-forward recovery.
>
> We need to:
>
> Detect cases 2 and 3.
>
> Alert the administrator and abort startup.
> [Question: Is this always the desired behavior? We can
> handle case 2 without intervention. ]
>
> Let the administrator start a standalone backend, and
> perform a play-forward recovery for the system.
>
> ?3.3 - Status:
>
> In progress.
Sorry, I was confused by this.
> ?4 - Interactive Play-Forward Recovery for an Entire System
>
> Play-Forward File Recovery from a backup file must be interactive,
> because not all log files that we need are necessarily in the
> archive directory. It may be possible that not all the archive files
> we need can even fit on disk at one time.
>
> The system needs to be able to prompt the system administrator to feed
> it more log files.
>
> TODO: More here
Yes, we can have someone working on the GUI once the command-line
interface is defined.
> ?5 - Individual file consistent recovery
>
> ?5.1 - Problem:
>
> If a file detects corruption, and we restore it from backup, how do
> we know what archived files we need for recovery?
>
> Should file corruption (partial write, bad disk block, etc.) outside
> the system catalog cause us to abort the system, or should we just
> take the relation or database off-line?
Offline is often best so they can get in there and recover if needed.
We usually allow them in with a special flag or utility like
pg_resetxlog.
> Given a backup file, how do we determine the point in the log
> where we should start recovery for the file? What is the highest LSN
> we can use that will fully recover the file?
That is tricky. We have discussed it and your backup has to deal with
some pretty strange things that can happen while 'tar' is traversing the
directory.
> ?5.2 - Proposal:
>
> Put a file header on each file, and update that header to the last
> checkpoint LSN at least once every 'file_lsn_time_slack' minutes, or
> at least once every dbsize/'file_lsn_log_slack' megabytes of log
> written, where dbsize is the estimated size of the database. Have
> these values be settable from the config file. These updates would be
> distributed throughout the hour, or interspersed between regular
> amounts of log generation.
>
> If we have a database backup program or command, it can update the
> header on the file before backup to the greatest value it can assure
> to be safe.
Not sure.
> ?5.3 - Status:
>
> Waiting for input from pgsql-hackers.
>
> Questions:
>
> There are alternate methods than using a file header to get a
> known-good LSN lower bound for the starting point to recover a backup
> file. Is this the best way?
Not sure.
I am sure others will chime in with more information.
--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
From | Date | Subject | |
---|---|---|---|
Next Message | Bruce Momjian | 2002-07-05 05:45:09 | Re: CLUSTER not lose indexes |
Previous Message | Alvaro Herrera | 2002-07-05 05:34:38 | Re: CLUSTER not lose indexes |