From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
Cc: | pgsql-hackers-win32(at)postgresql(dot)org |
Subject: | Re: Can someone verify CVS tip on Win32? |
Date: | 2004-11-18 00:40:26 |
Message-ID: | 5912.1100738426@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers-win32 |
Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Tom Lane wrote:
>> Hmm ... I have a theory about it, but I'm not sure how to reproduce the
>> problem. How many databases have you created in the installation that
>> the contrib installcheck is running against?
> Just what make installcheck / make contrib installcheck runs.
OK. I still haven't been able to reproduce it, but the place where it
is failing is consistent with my theory, which is:
1. CREATE DATABASE creates a pg_database row for "regression" that is
the last or nearly last row that will fit into block 0 of pg_database.
It then flushes this block to disk to ensure that new backends can see
the row in GetRawDatabaseInfo.
2. pg_regress.sh then does several ALTER DATABASE operations. These
will mark the original row dead and make a new row. At the end of this,
I hypothesize that the live copy of the "regression" row is in
pg_database block 1, not block 0. And it's not been flushed to disk,
because ALTER DATABASE fails to do that.
3. (Here's the hard-to-reproduce part.) Assume that something causes
block 0, but not block 1, of pg_database to be flushed from shared
buffers to disk.
4. Now, an incoming backend will see the original pg_database row for
"regression" as committed dead, so it'll ignore it. It can't see the
live row because that's not been flushed to disk; it's only in shared
buffers. Ergo, GetRawDatabaseInfo fails.
The problem goes away as soon as a checkpoint happens, but it's still
possible for the regression tests to fail this way.
A reasonable theory about step 3 is that the bgwriter chooses to write
out block 0 at just the right time. This would happen infrequently
enough to explain why we've not seen this reported before.
This theory explains why the failure consistently happens at the same
place in the test sequence, and why that place is machine-architecture
dependent: it can only happen when a certain number of pg_database rows
have been created and deleted, and the magic number depends on the
machine MAXALIGN value because that affects the size of the rows.
The fix of course is that ALTER DATABASE must flush pg_database to disk,
just as RENAME does.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Andrew Dunstan | 2004-11-18 01:04:19 | Re: Can someone verify CVS tip on Win32? |
Previous Message | Andrew Dunstan | 2004-11-18 00:04:42 | Re: Can someone verify CVS tip on Win32? |