From: | Greg Smith <greg(at)2ndQuadrant(dot)com> |
---|---|
To: | Greg Stark <stark(at)mit(dot)edu> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: buffer assertion tripping under repeat pgbench load |
Date: | 2012-12-30 03:07:45 |
Message-ID: | 50DFB001.7010000@2ndQuadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 12/27/12 7:43 AM, Greg Stark wrote:
> If it's always the first buffer then it could conceivably still be
> some other heap allocated object that always lands before
> LocalRefCount. It does seem a bit weird to be storing 1<<30 though --
> there are no 1<<30 constants that we might be storing for example.
It is a strange power of two to be appearing there. I can follow your
reasoning for why this could be a bit flipping error. There's no sign
of that elsewhere though, no other crashes under load. I'm using this
server here because it's worked fine for a while now.
I added printing the buffer number, and they're all over the place:
2012-12-27 06:36:39 EST [26306]: WARNING: refcount of buf 29270
containing base/16384/90124 blockNum=82884, flags=0x127 is 1073741824
should be 0, globally: 0
2012-12-27 02:08:19 EST [21719]: WARNING: refcount of buf 114262
containing base/16384/81932 blockNum=133333, flags=0x106 is 1073741824
should be 0, globally: 0
2012-12-26 20:03:05 EST [15117]: WARNING: refcount of buf 142934
containing base/16384/73740 blockNum=87961, flags=0x127 is 1073741824
should be 0, globally: 0
The relation continues to bounce between pgbench_accounts and its
primary key, no pattern there either I can see. To answer a few other
questions: this system does not have ECC RAM. It did survive many
passes of memtest86+ without any problems though, right after the above.
I tried duplicating the problem on a similar server. It keeps hanging
due to some Linux software RAID bug before it runs for very long.
Whatever is going on here, it really doesn't want to be discovered.
For reference sake, the debugging code those latest messages came from
is now:
diff --git a/src/backend/storage/buffer/bufmgr.c
b/src/backend/storage/buffer/bufmgr.c
index dddb6c0..60d3ad3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1697,11 +1697,27 @@ AtEOXact_Buffers(bool isCommit)
if (assert_enabled)
{
int i;
+ int RefCountErrors = 0;
for (i = 0; i < NBuffers; i++)
{
- Assert(PrivateRefCount[i] == 0);
+
+ if (PrivateRefCount[i] != 0)
+ {
+ /*
+
PrintBufferLeakWarning(&BufferDescriptors[i]);
+ */
+ BufferDesc *bufHdr = &BufferDescriptors[i];
+ elog(WARNING,
+ "refcount of buf %d containing
%s blockNum=%u, flags=0x%x is %u should be 0, globally: %u",
+
i,relpathbackend(bufHdr->tag.rnode, InvalidBackendId, bufHdr->tag.forkNum),
+ bufHdr->tag.blockNum,
bufHdr->flags, PrivateRefCount[i], bufHdr->refcount);
+ RefCountErrors++;
+ }
}
+ if (RefCountErrors > 0)
+ elog(WARNING, "buffers with non-zero refcount is
%d", RefCountErrors);
+ Assert(RefCountErrors == 0);
}
#endif
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Geoghegan | 2012-12-30 03:12:34 | Re: pg_stat_statements: calls under-estimation propagation |
Previous Message | Robert Haas | 2012-12-30 03:03:42 | Re: PATCH: optimized DROP of multiple tables within a transaction |