RE: WAL documentation

From: "Christopher Kings-Lynne" <chriskl(at)familyhealth(dot)com(dot)au>
To: <pgsql-hackers(at)postgresql(dot)org>
Cc: <postgresql-docs(at)postgresql(dot)org>
Subject: RE: WAL documentation
Date: 2001-01-24 05:52:53
Message-ID: NEBBIODEHDOLHLJPJCDDGEBGCAAA.chriskl@familyhealth.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Also, what happens with the size of the WAL logs? Do they just grow forever eventually filling up your hard drive, or should they reach a stable point where they tend not to grow any further?

ie. Will we sysadmins have to put cron jobs in to tar/gz old WAL logs or what???

Chris

> -----Original Message-----
> From: pgsql-hackers-owner(at)postgresql(dot)org
> [mailto:pgsql-hackers-owner(at)postgresql(dot)org]On Behalf Of Barry Lind
> Sent: Wednesday, January 24, 2001 12:32 PM
> To: pgsql-hackers(at)postgresql(dot)org
> Cc: postgresql-docs(at)postgresql(dot)org
> Subject: Re: [HACKERS] WAL documentation
>
>
> Not knowing much about WAL, but understanding a good deal about Oracle's
> logs, I read the WAL documentation below. While it is good, after
> reading it I am still left with a couple of questions and therefore
> believe the doc could be improved a bit.
>
> The two questions I am left with after reading the WAL doc are:
>
> 1) In the 'WAL Parameters' section, paragraph 3 there is the following
> sentence:
> "After a checkpoint has been made, any log segments written before the
> redo record may be removed/archived..." What does the 'may' refer
> mean? Does the database administrator need to go into the directory and
> remove the no longer necessary log files? What does archiving have to
> do with this? If I archived all log files, could I roll forward a
> backup made previously? That is the only reason I can think of that you
> would archive log files (at least that is why you archive log files in
> Oracle).
>
> 2) The doc doesn't seem to explain how on database recovery the database
> knows which log file to start with. I think walking through an example
> of how after a database crash, the log file is used for recovery, would
> be useful. At least it would make me as a user of postgres feel better
> if I understood how crashes are recovered from.
>
> thanks,
> --Barry
>
>
>
>
> Oliver Elphick wrote:
> >
> > Here is documentation for WAL, as text for immediate review and as SGML
> > source, generated from Vadim's original text with my editing.
> >
> > Please review for correctness.
> >
> > =========================== WAL chapter ==========================
> >
> > Write-Ahead Logging (WAL) in Postgres
> >
> > Author: Written by Vadim Mikheev and Oliver Elphick.
> >
> > General description
> >
> > Write Ahead Logging (WAL) is a standard approach to transaction logging.
> > Its detailed description may be found in most (if not all) books about
> > transaction processing. Briefly, WAL's central concept is that
> changes to
> > data files (where tables and indices reside) must be written only after
> > those changes have been logged - that is, when log records have been
> > flushed to permanent storage. When we follow this procedure, we do not
> > need to flush data pages to disk on every transaction commit, because we
> > know that in the event of a crash we will be able to recover
> the database
> > using the log: any changes that have not been applied to the data pages
> > will first be redone from the log records (this is roll-forward
> recovery,
> > also known as REDO) and then changes made by uncommitted transactions
> > will be removed from the data pages (roll-backward recovery - UNDO).
> >
> > Immediate benefits of WAL
> >
> > The first obvious benefit of using WAL is a significantly reduced number
> > of disk writes, since only the log file needs to be flushed to disk at
> > the time of transaction commit; in multi-user environments, commits of
> > many transactions may be accomplished with a single fsync() of the log
> > file. Furthermore, the log file is written sequentially, and so the cost
> > of syncing the log is much less than the cost of syncing the data pages.
> >
> > The next benefit is consistency of the data pages. The truth is that,
> > before WAL, PostgreSQL was never able to guarantee consistency in the
> > case of a crash. Before WAL, any crash during writing could result in:
> >
> > 1. index tuples pointing to non-existent table rows;
> > 2. index tuples lost in split operations;
> > 3. totally corrupted table or index page content, because of
> > partially written data pages.
> >
> > (Actually, the first two cases could even be caused by use of
> the "pg_ctl
> > -m {fast | immediate} stop" command.) Problems with indices (problems
> > 1 and 2) might have been capable of being fixed by additional fsync()
> > calls, but it is not obvious how to handle the last case without WAL;
> > WAL saves the entire data page content in the log if that is required
> > to ensure page consistency for after-crash recovery.
> >
> > Future benefits
> >
> > In this first release of WAL, UNDO operation is not implemented, because
> > of lack of time. This means that changes made by aborted transactions
> > will still occupy disk space and that we still need a permanent pg_log
> > file to hold the status of transactions, since we are not able to re-use
> > transaction identifiers. Once UNDO is implemented, pg_log will
> no longer
> > be required to be permanent; it will be possible to remove pg_log at
> > shutdown, split it into segments and remove old segments.
> >
> > With UNDO, it will also be possible to implement SAVEPOINTs to allow
> > partial rollback of invalid transaction operations (parser errors caused
> > by mistyping commands, insertion of duplicate primary/unique keys and
> > so on) with the ability to continue or commit valid operations made by
> > the transaction before the error. At present, any error will invalidate
> > the whole transaction and require a transaction abort.
> >
> > WAL offers the opportunity for a new method for database on-line backup
> > and restore (BAR). To use this method, one would have to make periodic
> > saves of data files to another disk, a tape or another host and also
> > archive the WAL log files. The database file copy and the archived
> > log files could be used to restore just as if one were restoring after a
> > crash. Each time a new database file copy was made the old log
> files could
> > be removed. Implementing this facility will require the logging of data
> > file and index creation and deletion; it will also require
> development of
> > a method for copying the data files (O/S copy commands are not
> suitable).
> >
> > Implementation
> >
> > WAL is automatically enabled from release 7.1 onwards. No action is
> > required from the administrator with the exception of ensuring that the
> > additional disk-space requirements of the WAL logs are met, and that
> > any necessary tuning is done (see below).
> >
> > WAL logs are stored in $PGDATA/pg_xlog, as a set of segment files, each
> > 16Mb in size. Each segment is divided into 8Kb pages. The log record
> > headers are described in access/xlog.h; record content is
> dependent on the
> > type of event that is being logged. Segment files are given sequential
> > numbers as names, starting at 0000000000000000. The numbers do
> not wrap,
> > at present, but it should take a very long time to exhaust the available
> > stock of numbers.
> >
> > The WAL buffers and control structure are in shared memory, and are
> > handled by the backends; they are protected by spinlocks. The demand
> > on shared memory is dependent on the number of buffers; the default size
> > of the WAL buffers is 64Kb.
> >
> > It is desirable for the log to be located on another disk than the main
> > database files. This may be achieved by moving the directory, pg_xlog,
> > to another filesystem (while the postmaster is shut down, of course)
> > and creating a symbolic link from $PGDATA to the new location.
> >
> > The aim of WAL, to ensure that the log is written before database
> > records are altered, may be subverted by disk drives that falsely report
> > a successful write to the kernel, when, in fact, they have only cached
> > the data and not yet stored it on the disk. A power failure in such a
> > situation may still lead to irrecoverable data corruption;
> administrators
> > should try to ensure that disks holding PostgreSQL's data and log files
> > do not make such false reports.
> >
> > WAL parameters
> >
> > There are several WAL-related parameters that affect database
> > performance. This section explains their use.
> >
> > There are two commonly used WAL functions - LogInsert and LogFlush.
> > LogInsert is used to place a new record into the WAL buffers in shared
> > memory. If there is no space for the new record, LogInsert will have to
> > write (move to OS cache) a few filled WAL buffers. This is undesirable
> > because LogInsert is used on every database low level modification
> > (for example, tuple insertion) at a time when an exclusive lock is held
> > on affected data pages and the operation is supposed to be as fast as
> > possible; what is worse, writing WAL buffers may also cause the creation
> > of a new log segment, which takes even more time. Normally, WAL buffers
> > should be written and flushed by a LogFlush request, which is made,
> > for the most part, at transaction commit time to ensure that transaction
> > records are flushed to permanent storage. On systems with high
> log output,
> > LogFlush requests may not occur often enough to prevent WAL buffers'
> > being written by LogInsert. On such systems one should increase the
> > number of WAL buffers by modifying the "WAL_BUFFERS" parameter. The
> > default number of WAL buffers is 8. Increasing this value will have an
> > impact on shared memory usage.
> >
> > Checkpoints are points in the sequence of transactions at which it is
> > guaranteed that the data files have been updated with all information
> > logged before the checkpoint. At checkpoint time, all dirty data pages
> > are flushed to disk and a special checkpoint record is written to the
> > log file. As result, in the event of a crash, the recoverer knows from
> > what record in the log (known as the redo record) it should start the
> > REDO operation, since any changes made to data files before that record
> > are already on disk. After a checkpoint has been made, any log segments
> > written before the redo record may be removed/archived, so checkpoints
> > are used to free disk space in the WAL directory. The checkpoint maker
> > is also able to create a few log segments for future use, so as to avoid
> > the need for LogInsert or LogFlush to spend time in creating them.
> >
> > The WAL log is held on the disk as a set of 16Mb files called segments.
> > By default a new segment is created only if more than 75% of the current
> > segment is used. One can instruct the server to create up to 64 log
> > segments at checkpoint time by modifying the "WAL_FILES" parameter.
> >
> > For faster after-crash recovery, it would be better to create
> checkpoints
> > more often. However, one should balance this against the cost
> of flushing
> > dirty data pages; in addition, to ensure data page consistency,the first
> > modification of a data page after each checkpoint results in logging
> > the entire page content, thus increasing output to log and the
> log's size.
> >
> > By default, the postmaster spawns a special backend process to
> create the
> > next checkpoint 300 seconds after the previous checkpoint's creation.
> > One can change this interval by modifying the "CHECKPOINT_TIMEOUT"
> > parameter. It is also possible to force a checkpoint by using the SQL
> > command, CHECKPOINT.
> >
> > Setting the "WAL_DEBUG" parameter to any non-zero value will result in
> > each LogInsert and LogFlush WAL call's being logged to standard error.
> > At present, it makes no difference what the non-zero value is.
> >
> > The "COMMIT_DELAY" parameter defines for how long the backend will be
> > forced to sleep after writing a commit record to the log with LogInsert
> > call but before performing a LogFlush. This delay allows other backends
> > to add their commit records to the log so as to have all of them flushed
> > with a single log sync. Unfortunately, this mechanism is not fully
> > implemented at release 7.1, so there is at present no point in changing
> > this parameter from its default value of 5 microseconds.
> >
> > ===================== CHECKPOINT manual page ======================
> >
> > CHECKPOINT -- Forces a checkpoint in the transaction log
> >
> > Synopsis
> >
> > CHECKPOINT
> >
> > Inputs
> >
> > None
> >
> > Outputs
> >
> > CHECKPOINT
> >
> > This signifies that a checkpoint has been placed into the
> transaction log.
> >
> > Description
> >
> > Write-Ahead Logging (WAL) puts a checkpoint in the log every 300 seconds
> > by default. (This may be changed by the parameter CHECKPOINT_TIMEOUT
> > in postgresql.conf.)
> >
> > The CHECKPOINT command forces a checkpoint at the point at which the
> > command is issued. The next automatic checkpoint will happen the default
> > time after the forced checkpoint.
> >
> > Restrictions
> >
> > Use of the CHECKPOINT command is restricted to users with administrative
> > access.
> >
> > Usage
> >
> > To force a checkpoint in the transaction log:
> >
> > CHECKPOINT;
> >
> > Compatibility
> >
> > SQL92
> >
> > CHECKPOINT is a Postgres language extension. There is no CHECKPOINT
> > command in SQL92.
> >
> > Note: The handling of database storage and logging is a matter that the
> > standard leaves to the implementation.
> >
> >
> ------------------------------------------------------------------------
> > Name: checkpoint.sgml
> > checkpoint.sgml Type: text/x-sgml
> > Description: checkpoint.sgml
> >
> > Name: wal.sgml
> > wal.sgml Type: text/x-sgml
> > Description: wal.sgml
> >
> > Part 1.4Type: Plain Text (text/plain)
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2001-01-24 05:57:08 Re: This script will crash the connection
Previous Message The Hermit Hacker 2001-01-24 05:31:51 Re: beta4 ... almost time to wrap one ...