Quick Links

SSI slows down over time

From:	Ryan Johnson <ryan(dot)johnson(at)cs(dot)utoronto(dot)ca>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	SSI slows down over time
Date:	2014-04-06 02:25:13
Message-ID:	5340BB09.5010101@cs.utoronto.ca
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

Hi all,

Disclaimer: this question probably belongs on the hackers list, but the
instructions say you have to try somewhere else first... toss-up between
this list and a bug report; list seemed more appropriate as a starting
point. Happy to file a bug if that's more appropriate, though.

This is with pgsql-9.3.4, x86_64-linux, home-built with `./configure
--prefix=...' and gcc-4.7.
TPC-C courtesy of oltpbenchmark.com. 12WH TPC-C, 24 clients.

I get a strange behavior across repeated runs: each 100-second run is a
bit slower than the one preceding it, when run with SSI (SERIALIZABLE).
Switching to SI (REPEATABLE_READ) removes the problem, so it's
apparently not due to the database growing. The database is completely
shut down (pg_ctl stop) between runs, but the data lives in tmpfs, so
there's no I/O problem here. 64GB RAM, so no paging, either.

Note that this slowdown is in addition to the 30% performance from using
SSI on my 24-core machine. I understand that the latter is a known
bottleneck; my question is why the bottleneck should get worse over time:

With SI, I get ~4.4ktps, consistently.
With SSI, I get 3.9, 3.8, 3.4. 3.3, 3.1, 2.9, ...

So the question: what should I look for to diagnose/triage this problem?
I'm willing to do some legwork, but have no idea where to go next.

I've tried linux perf, but all it says is that lots of time is going to
LWLock (but callgraph tracing doesn't work in my not-bleeding-edge
kernel). Looking through the logs, the abort rates due to SSI aren't
changing in any obvious way. I've been hacking on SSI for over a month
now as part of a research project, and am fairly familiar with
predicate.c, but I don't see any obvious reason this behavior should
arise (in particular, SLRU storage seems to be re-initialized every time
the postmaster restarts, so there shouldn't be any particular memory
effect due to SIREAD locks). I'm also familiar with both Cahill's and
Ports/Grittner's published descriptions of SSI, but again, nothing
obvious jumps out.

In my experience this sort of behavior indicates a type of bug where
fixing it would have a large impact on performance (because the early
"damage" is done so quickly that even the very first run doesn't live up
to its true potential).

$ cat pgsql.conf
shared_buffers = 8GB
synchronous_commit = off
checkpoint_segments = 64
max_pred_locks_per_transaction = 2000
default_statistics_target = 100
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
effective_cache_size = 40GB
work_mem = 1920MB
wal_buffers = 16MB

Thanks,
Ryan

Responses

Re: SSI slows down over time at 2014-04-06 08:30:46 from Heikki Linnakangas
Re: SSI slows down over time at 2014-04-06 14:55:37 from Tom Lane
Re: SSI slows down over time at 2014-04-07 14:38:52 from Ryan Johnson

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Heikki Linnakangas	2014-04-06 08:30:46	Re: SSI slows down over time
Previous Message	Varadharajan Mukundan	2014-04-05 01:13:12	Re: Fwd: Slow Count-Distinct Query