buffer assertion tripping under repeat pgbench load

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: buffer assertion tripping under repeat pgbench load
Date: 2012-12-23 07:36:42
Message-ID: 50D6B48A.1070905@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I'm testing a checkout from a few days ago and trying to complete a day
long pgbench stress test, with assertions and debugging on. I want to
make sure the base code works as expected before moving on to testing
checksums. It's crashing before finishing though. Here's a sample:

2012-12-20 20:36:11 EST [26613]: LOG: checkpoint complete: wrote 32
buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled;
write=3.108 s, sync=0.978 s, total=4.187 s; sync files=7, longest=0.832
s, average=0.139 s
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c",
Line: 1703)
2012-12-20 20:36:44 EST [26611]: LOG: server process (PID 4834) was
terminated by signal 6: Aborted
2012-12-20 20:36:44 EST [26611]: DETAIL: Failed process was running: END;

Running the same test set again gave the same crash, so this looks
repeatable. It's not trivial to trigger given the average runtime to
crash I'm seeing. Second sample:

2012-12-22 21:27:17 EST [6006]: LOG: checkpoint complete: wrote 34
buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled;
write=3.310 s, sync=2.906 s, total=6.424 s; sync files=7, longest=2.648
s, average=0.415 s
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c",
Line: 1703)
2012-12-22 21:27:23 EST [26611]: LOG: server process (PID 12143) was
terminated by signal 6: Aborted
2012-12-22 21:27:23 EST [26611]: DETAIL: Failed process was running: END;

The important part:

TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c",
Line: 1703)

That's the check done by "AtEOXact_Buffers - clean up at end of
transaction". It fired while executing the "END;"" statement at the end
of the standard pgbench test. The test looks for loose things that
weren't pinned/unpinned correctly by the transaction. I'm not sure if
getting a stack trace at that point will be useful. The ref count
mistake could easily have been made by an earlier statement in the
transaction block, right?

This is on a server where shared_buffers=2GB, checkpoint_segments=64.
Test runs were automated with pgbench-tools, using the following
configuration:

MAX_WORKERS="4"
SCRIPT="tpc-b.sql"
SCALES="500 1000"
SETCLIENTS="4 8 16 32 64 96"
SETTIMES=3
RUNTIME=600

All of the crashes so far have been at scale=500, and I haven't seen
those finish yet. It failed 7 hours in the first time, then 4 hours in.

For code reference, the last commit in the source code tree I built
against was:

af275a12dfeecd621ed9899a9382f26a68310263
Thu Dec 20 14:23:31 2012 +0200

Looking at recent activity, the only buffer pin related changes I saw
was this one:

commit 62656617dbe49cca140f3da588a5e311b3fc35ea
Date: Mon Dec 3 18:53:31 2012 +0000
Avoid holding vmbuffer pin after VACUUM.

This is my first test like this against 9.3 development though, so the
cause could be an earlier commit. I'm just starting with the most
recent work as the first suspect. Next I think I'll try autovacuum=off
and see if the crash goes away. Other ideas are welcome.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2012-12-23 09:15:33 Re: buffer assertion tripping under repeat pgbench load
Previous Message Andres Freund 2012-12-23 01:11:05 Re: pgcrypto seeding problem when ssl=on