From: | Amit Kapila <amit(dot)kapila(at)huawei(dot)com> |
---|---|
To: | "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | "'Alvaro Herrera'" <alvherre(at)commandprompt(dot)com>, 'Cédric Villemain' <cedric(at)2ndquadrant(dot)com>, "'Pg Hackers'" <pgsql-hackers(at)postgresql(dot)org>, "'Robert Haas'" <robertmhaas(at)gmail(dot)com> |
Subject: | Re: Allow WAL information to recover corrupted pg_controldata |
Date: | 2012-06-20 13:21:51 |
Message-ID: | 004701cd4ee7$a8139540$f83abfc0$@kapila@huawei.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
> I'm almost inclined to suggest that we not get next-LSN from WAL, but
> by scanning all the pages in the main data store and computing the max
> observed LSN. This is clearly not very attractive from a performance
> standpoint, but it would avoid the obvious failure mode where you lost
> some recent WAL segments along with pg_control.
According to my analysis, this will have some problem.
I will explain the problem by taking example scenario.
Example Scenario -
Let us assume that database crashes and it can be recovered by doing crash recovery.
Now assume we have Data files and WAL files intact and only control file is lost.
Now user uses pg_resetxlog to generate pg_control file and we uses new algorithm to generate next-LSN.
Summary of events before database crash-
1. Checkpoint was in progress and it has already noted next-LSN location (LSN-107) and mark the dirty pages as BM_CHECKPOINT_NEEDED.
2. At this point a new transaction dirties 2 pages, first it dirties a fresh page (for this change LSN-108)
and then it dirties one which is already marked as BM_CHECKPOINT_NEEDED (for this change LSN-109).
3. CheckPoint starts flushing pages.
4. It will now flush the page with LSN-109 but not the page 108.
4. Checkpoint finishes.
5. Database crashes.
Normal Crash Recovery -
it will start the replay from 107 and after recovery the database will be in consistent state.
Pg_resetxlog -
It will generate the next-LSN point as 109 which when used for recovery will generate inconsistent database.
However if we would have relied on WAL, it would have got next-LSN as 107.
This is just an Example case to show that there can be some problems using the algorithm for generating
next-LSN from pages. However it doesn't prove that generating from WAL will be correct.
Please correct my understanding if I am wrong.
With Regards,
Amit Kapila.
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2012-06-20 13:21:54 | Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node |
Previous Message | Robert Haas | 2012-06-20 13:19:55 | Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node |