From: | "Florian G(dot) Pflug" <fgp(at)phlo(dot)org> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Full page images in WAL & Cache Invalidation |
Date: | 2007-07-23 13:32:00 |
Message-ID: | 46A4ADD0.8040500@phlo.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Tom Lane wrote:
> "Florian G. Pflug" <fgp(at)phlo(dot)org> writes:
>>> Anyway, if you believe that DDL is infrequent, why are you resistant
>>> to the idea of WAL-logging cache flushes?
>
>> First, cache invalidations are not the only problem caused by replaying
>> system-table updates. The whole SnapshotNow
>> business doesn't exactly make things easier too. So it feels like a
>> lot of added complexity and code for little gain - unless a *lot*
>> more things (like locking requests) are logged too.
>
> The mention of locking requests brought to mind the following
> gedankenexperiment:
>
> 1. slave server backend is running some long-running query on table X.
>
> 2. WAL-reading process receives and executes DROP TABLE X.
>
> (It doesn't even have to be a DROP; most varieties of ALTER are enough
> to create problems for a concurrently-running query.)
>
> It's really hard to see how to defend against that without a fairly
> complete simulation of locking on the slave side.
Well, locking on the slave is a bit easier than on the master, for
two reasons
1) Queries running on the slave only lock in AccessShareLock mode -
therefore, you only need to know if a certain mode conflics with
AccessShareLock - and there seems to be only one that does, namely
AccessExclusiveLock. So we really only need to care about
AccessExclusiveLock locks on the master
2) As far as I can see, the point of an AccessExclusiveLock is *not*
actually preventing queries from running while a DDL statement is
*executed*, but rather preventing queries from running while the
statement is *committed*. This fits the fact that system tables are
read using SnapshotNow (not SnapshotDirty) - while the DDL is
running, everybody is happily using the old information, the trouble
would only start after the commit because with SnapshotNow you
suddenly see the new state.
I not yet 100% sure that (2) holds (with the exception of VACUUM FULL)-
but I'm fairly confident, because if (2) was wrong, than how would the
system survive a crash during the execution of a DDL statement?
So after a bit more thought (And reading. Thanks for all the replies,
guys! They are greatly appreciated.),
I came up with the following plan for both inval events and locks
.) Store two flags in the commit record of a transaction, for
"transaction generated inval events" and "transaction held an
access exlusive lock".
.) Upon replay, block until no transactions are running (for
"transaction held an exclusive lock") before replaying the
record, or flush the caches after replaying it (for
"transaction generated inval events").
This scheme has two really nice properties:
First, it can be extended fairly easily to not store a simple flag, but
a list of OIDs, and use that to make the flushing and blocking more
fine-grained - without changing the basic way in which things work.
And second, it stores all information needed in the *commit* record.
That removes problems with transactions that are aborted due to a
crash, and therefor the WAL doesn't tell that they were aborted.
VACUUM FULL will need some special care - but at least VACUUM FULL is
already such a disruptive operation, that it probably won't surprise
anyone if it's disruptive on the slave too. (And now that CLUSTER is
MVCC-Safe from what I read, the usecase for VACUUM FULL seems to
be pretty slim).
The btree metapage caching will need special treatement too - probably
some flags in the WAL record that change the metapage that instruct the
slave to synthesize a suitable inval event.
What stays open is regular VACUUM (and maybe inline vacuuming - is that
part of the latest version of HOT, or did it get removed?). Here, the
key is logging the return value of GetOldestXmin() I think. Since that
value is what decides in the end if a tuple can be killed or not, having
it available on the slave should allow the slave to block replay until
no slave query depends on the tuple anymore.
greetings, Florian Pflug
From | Date | Subject | |
---|---|---|---|
Next Message | Magnus Hagander | 2007-07-23 13:36:21 | Oops in fe-auth.c |
Previous Message | Dave Page | 2007-07-23 13:29:03 | Re: COPYable logs |