Re: Problems with pg_locks explosion

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Armand du Plessis <adp(at)bank(dot)io>
Cc: pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Problems with pg_locks explosion
Date: 2013-04-02 00:21:16
Message-ID: CAMkU=1x5Ff1DAYYknDpOyJhDrAUxvWxg+c73A42t3S_GDGpRng@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Mon, Apr 1, 2013 at 3:35 PM, Armand du Plessis <adp(at)bank(dot)io> wrote:

> [Apologies, I first sent this to the incorrect list, postgres-admin, in
> the event you receive it twice]
>
> Hi there,
>
> I'm hoping someone on the list can shed some light on an issue I'm having
> with our Postgresql cluster. I'm literally tearing out my hair and don't
> have a deep enough understanding of Postgres to find the problem.
>
> What's happening is I had severe disk/io issues on our original Postgres
> cluster (9.0.8)
>
and switched to a new instance with a RAID-0 volume array.
>

What was the old instance IO? Did you do IO benchmarking on both?

> The machine's CPU usage would hover around 30% and our database would run
> lightning fast with pg_locks hovering between 100-200.
>
> Within a few seconds something would trigger a massive increase in
> pg_locks so that it suddenly shoots up to 4000-8000. At this point
> everything dies. Queries that usually take a few milliseconds takes minutes
> and everything is unresponsive until I restart postgres.
>

I think that pg_locks is pretty much a red herring. All it means is that
you have a lot more active connections than you used to. All active
connections are going to hold various locks, while most idle connections
(other than 'idle in transaction') connections will not hold any.

Although I doubt it will solve this particular problem, you should probably
use a connection pooler.

> shared_buffers = 32GB
>

That seems very high. There are reports that using >8 GB leads to
precisely the type of problem you are seeing (checkpoint associated
freezes). Although I've never seen those reports when fsync=off.

I thought you might be suffering from the problem solved in release 9.1 by
item "Merge duplicate fsync requests (Robert Haas, Greg Smith)", but then I
realized that with fsync=off it could not be that.

>
> max_connections = 800
>

That also is very high.

> The problems seem to overlap with checkpoints.
>
> 2013-04-01 21:31:35.592 UTC,,,26877,,5159fa5f.68fd,1,,2013-04-01 21:21:35
> UTC,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""
> 2013-04-01 21:40:35.033 UTC,,,26877,,5159fa5f.68fd,2,,2013-04-01 21:21:35
> UTC,,0,LOG,00000,"checkpoint complete: wrote 100635 buffers (2.4%); 0
> transaction log file(s) added, 0 removed, 1 recycled; write=539.439 s,
> sync=0.000 s, total=539.441 s; sync files=0, longest=0.000 s, average=0.000
> s",,,,,,,,,""
> 2013-04-01 21:41:35.093 UTC,,,26877,,5159fa5f.68fd,3,,2013-04-01 21:21:35
> UTC,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""
>

I think you changed checkpoint_timout from default (5 min) to 10 minutes,
without telling us. Anyway, this is where it would be nice to know how
much of the 539.439 s in the write phase was spent blocking on writes, and
how much was spent napping. But that info is not collected by pgsql.

Your top output looked for it was a time at which there were no problems,
and it didn't include the top processes, so it wasn't very informative.

If you could upgrade to 9.2 and capture some data with track_io_timing,
that could be useful.

Cheers,

Jeff

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Bruce Momjian 2013-04-02 00:27:06 Re: Postgres upgrade, security release, where?
Previous Message Mark Kirkwood 2013-04-02 00:11:15 Re: Problems with pg_locks explosion