From: | Simon Riggs <simon(at)2ndquadrant(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org |
Subject: | Re: Why we really need timelines *now* in PITR |
Date: | 2004-07-19 21:58:16 |
Message-ID: | 1090274296.28049.317.camel@stromboli |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, 2004-07-19 at 19:33, Tom Lane wrote:
> I wrote:
> > I think there's really no way around the issue: somehow we've got to
> > keep some meta-history outside the $PGDATA area, if we want to do this
> > in a clean fashion.
>
> After further thought I think we can fix this stuff by creating a
> "history file" for each timeline. This will make recovery slightly more
> complicated but I don't think it would be any material performance
> problem. Here's how it goes:
Yes...I came to the conclusion that trying to avoid doing something like
DB2 does was just stubornness on my part. We may as well use analogies
with other systems when they are available.
All of this is good. Two main areas of comments/questions, noted (**)
Timelines should be easy to understand for anybody that can follow a
HACKERS conversation anyhow :)
>
> * Timeline IDs are 32-bit ints with no particular semantic significance
> (that is, we do not assume timeline 3 is a child of 2, or anything like
> that). The actual parentage of a timeline has to be found by inspecting
> its history file.
>
OK...thats better. The nested idea doesn't read well second time
through.
> * History files will be named by their timeline ID, say "00000042.history".
> They will be created in /pg_xlog whenever a new timeline is created
> by the act of doing a recovery to a point in time earlier than the end
> of existing WAL. When doing WAL archiving a history file can be copied
> off to the archive area by the existing archiver mechanism (ie, we'll
> make a .ready file for it as soon as it's written).
>
Need to check the archive code which relies on file shape and length
> * History files will be plain text (for human consumption) and will
> essentially consist of a list of parent timeline IDs in sequence.
> I envision adding the timeline split timestamp and starting WAL segment
> number too, but these are for documentation purposes --- the system
> doesn't need them. We may as well allow comments in there as well,
> so that the DBA can annotate the reasons for a PITR split to have been
> done. So the contents might look like
>
> # Recover from unintentional TRUNCATE
> 00000001 0000000A00142568 2005-05-16 12:34:15 EDT
> # Ex-assistant DBA dropped wrong table
> 00000007 0000002200005434 2005-11-17 18:44:44 EST
>
Or should there be a recovery_comment parameter in the recovery.conf?
That would be better than suggesting that admins can edit such an
important file. (Even if they can, its best not to encourage it).
> When we split off a new timeline, we just have to copy the parent's
> history file (which we can do verbatim including comments) and then
> add a new line at the end showing the immediate parent's timeline ID
> and the other details of the split. Initdb can create 00000001.history
> with empty contents (since that timeline has no parents).
Yes.
Will you then delete the previous timeline's history file or just leave
it there? (OK, you say that later)
> * When we need to do recovery, we first identify the source timeline
> (either by reading the current timeline ID from pg_control, or the DBA
> can tell us with a parameter in recovery.conf). We then read the
> history file for that timeline, and remember its sequence of parent
> timeline IDs. We can crosscheck that pg_control's timeline ID is
> one of this set of timeline IDs, too --- if it's not then the wrong
> backup file was restored.
** Surely it is the backup itself that determines the source timeline?
Backups are always taken in one particular timeline. The rollforward
must start at a checkpoint before the begin backup and roll past the end
of backup marker onwards. The starting checkpoint should be the last
checkpoint prior to backup - why would you pick another? That checkpoint
will always be in the current timeline, since we always come out of
startup with a checkpoint (either because we shutdown earlier, or we
recovered and just wrote another shutdown checkpoint).
So the backup's timeline will determine the source timeline, but not
necessarily the target timeline.
...thinking....recovery.conf would need to specify:
recovery_target (if there is one, either a time or txnid)
recovery_target_timeline (if there is one, otherwise end of last one)
recovery_target_history_file (which specifies how the timeline ids are
sequenced)
I take it that your understanding is that the recovery_target timeline
needs to be specified also?
> * During recovery, whenever we need to open a WAL segment file, we first
> try to open it with the source timeline ID; if that doesn't exist, try
> the immediate parent timeline ID; then the grandparent, etc. Whenever
> we find a WAL file with a particular timeline ID, we forget about all
> parents further up in the history, and won't try to open their segments
> anymore (this is the generalization of my previous rule that you never
> drop down in timeline number as you scan forward).
>
This jigging around is OK, because most people will be using only one
timeline anyhow, so its not likely to cause too much fuss for the user.
> * If we end recovery because we have rolled forward off the end of WAL,
> we can just continue using the source timeline ID --- we are extending
> that timeline. (Thus, an ordinary crash and restart doesn't require
> generating a new timeline ID; nor do we generate a new line during
> normal postmaster stop/start.)
Yes, exactly - thats why it can't be the SUID.
> But if we stop recovery at a requested
> point-in-time earlier than end of WAL, we have to branch off a new
> timeline. We do this by:
> * Selecting a previously unused timeline ID (see below).
> * Writing a history file for this ID, by copying the parent
> timeline's history file and adding a new line at the end.
> * Copying the last-used WAL segment of the parent timeline,
> giving it the same segment number but the new timeline's ID.
> This becomes the active WAL segment when we start operating.
>
> * We can identify the highest timeline ID ever used by simply starting
> with the source timeline ID and probing pg_xlog and the archive area
> for history files N+1.history, N+2.history, etc until we find an ID
> for which there is no history file. Under reasonable scenarios this
> will not take very many probes, so it doesn't seem that we need any
> addition to the archiver API to make it more efficient.
** I would prefer to add a random number to the timeline as a way of
identifying the next one. This will produce fewer probes, so less wasted
tape mounts, but most importantly it gets round this issue:
You're on timeline X, then you recover and run for a while on timeline
Y. You then realise recovering to that target was a really bad idea for
some reason (some VIPs record wasn't in the recovered data etc). We then
need to re-recover from the backup on X to a new timeline, Z. But how
does X know that Y existed when it creates Z?
If Y = f(x) in a deterministic way, then Y will always == Z. Of course,
we could provide an id, but what would you pick? The best way is to get
out of trouble by picking a new timeline id that's very unlikely to have
been picked before.
If the sequence of timeline ids is not important, just pick one from the
billions you have available to you (and that aren't mentioned in the
history file). We can do this automatically and pick it randomly.
That way, when you re-recover you stand a vanishingly small chance of
picking any timeline id that you (or indeed anyone!) have ever used.
This will be very important for diagnosing problems, and it is my
experience that the re-recovery scenario happens on about 50% of
recoveries. i.e. if you recover once, you're very likely to recover 2 or
more times before you're really done. (...and if you don't believe me,
look what happened to Danske Bank running DB2 - recovered 4 times inside
a week, but hats off to those guys - they got it back in the end).
But then - we also need to be able to identify which was the latest
history file and searching a billion files might take a while. So the
sequential numbering does serve a purpose. Both ideas solve only one of
the two problems....hmmm, I think perhaps finding latest file is more
important and so perhaps sequential numbering should win after all?
> * Since history files will be small and made infrequently (one hopes you
> do not need to do a PITR recovery very often...) I see no particular
> reason not to leave them in /pg_xlog indefinitely. The DBA can clean
> out old ones if she is a neatnik, but I don't think the system needs to
> or should delete them. Similarly the archive area could be expected to
> retain history files indefinitely.
>
OK. Answered question above...
Yes, agreed. We'll want them for diagnostics anyway.
> * However, you *can* throw away a history file once you are no longer
> interested in rolling back to times predating the splitoff point of the
> timeline. If we don't find a history file we can just act as though the
> timeline has no parents (extends indefinitely far in the past). (Hm,
> so we don't actually have to bother creating 00000001.history...)
>
Agreed. Thats better, less files waiting around, less chance of being
deleted by over-diligent admins.
But we shouldn't encourage the deletion of those files. The worst
problems happen when people "tidy up" after they think recovery is over,
then delete an important file and we're back in traction again.
> * I'm intending to replace the current concept of StartUpID (SUI) by
> timeline IDs --- we'll record timeline IDs not SUIs in data page headers
> and WAL page headers. SUI isn't doing anything of value for us; I think
> it was probably intended to do what timelines will do, but it's not
> defined quite right for the purpose. One good thing about timeline IDs
> for WAL page headers is that we know exactly which IDs should be
> expected in a WAL file (either the current timeline or one of its
> parents); this allows a much tighter check than is possible with SUIs.
>
Definitely agree on this last part, that stuff about 512 SUIs was wierd.
> Anybody see any holes in this design?
>
As said already, All of this is good. Two main areas of
comments/questions, noted above. (**)
That's coherent and good.
Best regards, Simon Riggs
From | Date | Subject | |
---|---|---|---|
Next Message | Simon Riggs | 2004-07-19 22:14:24 | Re: [HACKERS] Point in Time Recovery |
Previous Message | Tom Lane | 2004-07-19 21:58:11 | Re: Why we really need timelines *now* in PITR |