From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Simon Riggs <simon(at)2ndquadrant(dot)com> |
Cc: | Martin Scholes <marty(at)iicolo(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Kings-Lynne <chris(dot)kings-lynne(at)calorieking(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: WAL Bypass for indexes |
Date: | 2006-04-03 13:55:41 |
Message-ID: | 7498.1144072541@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> Thinking about this some more, I ask myself: why is it we log index
> inserts at all? We log heap inserts, which contain all the information
> we need to replay all index inserts also, so why bother?
(1) We can't run user-defined functions during log replay. Quite
aside from any risk of nondeterminism, the normal transaction
infrastructure isn't functioning in that environment.
(2) Some of the index code is itself deliberately nondeterministic.
I'm thinking in particular of the move-right-or-not choice in
_bt_insertonpg() when there are many equal keys, but randomization is
in general a useful algorithmic technique that we'd have to forswear.
(3) In the presence of concurrency, the sequence of heap-insert WAL
records isn't enough info, because it doesn't tell you what order the
index inserts occurred in. The btree code, at least, is sufficiently
concurrent that even knowing the sequence of leaf-key insertions isn't
full information --- it's not hard to imagine cases where decisions
about where to split upper-level pages are dependent on which process
manages to obtain lock on a page first.
There are probably some other reasons that I forgot. Check the
archives; this point has been debated before.
Basically the problem here is that you can't mix logged and non-logged
operations --- if you're going to WAL-log any operations on an index
then you have to be sure that the replay will regenerate exactly the
same series of index states that happened the first time. So none of
this is an argument against "rebuild the index at end of replay"; but
I don't see any workable half measures.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Simon Riggs | 2006-04-03 14:28:50 | Re: WAL Bypass for indexes |
Previous Message | Horváth Sándor | 2006-04-03 13:48:47 | deferrable check, trigger |