Re: is it cool to restart servers as preventive maintenance?

From: Kiriakos Georgiou <kg(dot)postgresql(at)olympiakos(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: Re: is it cool to restart servers as preventive maintenance?
Date: 2016-02-10 21:47:15
Message-ID: 27A8518A-47E5-4BD0-9EF0-6AFEBE4A4DB3@olympiakos.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin


> On Feb 10, 2016, at 3:18 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Kiriakos Georgiou <kg(dot)postgresql(at)olympiakos(dot)com> writes:
>> In the last 12 months I have noticed 3-4 instances of database flakiness that is cured by restarting.
>> I’ve been using PostgreSQL since 2007 and I haven’t seen such issues requiring a reboot, but on my current project we do some rather heavy duty PostGIS analysis that apparently stresses the system enough to occasionally cause this (that’s my theory anyway.) I’m beginning to seriously consider restarting servers on a monthly basis.
>
> What sort of "database flakiness"?
>
> It's possible you're encountering some kind of bug (memory leak?) in
> PostGIS, but that would be a bug you ought to get them to fix, not a
> reason why periodic restarts are a good idea.
>
> regards, tom lane
>

Flaky = the database appears to be running OK (I can run queries via psql) but our app is down for no apparent reason. Restarting the app servers multiple times did not help. Although the database seemed to respond fine to queries via psql, I decided to restart it. That was a good move, our app worked fine after the database restart.

There is more to it: about 8 hours earlier our warm standby postgresql filled up the volume it puts the server logs by repeating the following two lines in the server log, millions of times:

WARNING: out of shared memory
CONTEXT: xlog redo AccessExclusive locks: xid 2002212 db 16384 rel 1079879

I had seen this on our primary about a year ago and I kept doubling max_locks_per_transaction all the way to 1024, at which point the problem did not reoccur (on the primary.)
I still occasionally (once every 3-4 months) see the “out of shared memory” message on the standby although max_locks_per_transaction has the same 1024 value as the primary. When this happens I have to rebuilt it from the primary via pg_basebackup. When the “out of shared memory” happens on the standby, it’s a coin toss whether the primary will behave flaky or not. This time it did, and the restart fixed it.

Writing this email made me realize the likely cause of our problem. It’s the “out of shared memory” issue. We do have a plpgsql function that calls 30+ other plpgsql functions, some of which create temp tables. Some of the calls are within loops. So depending on data inputs we can get hundreds of locks to temp tables within a single transaction. The odd thing is that at max_locks_per_transaction = 1024 we no longer get any “out of shared memory” on the primary, but we do on the stand by. Any ideas about that? Should I increase max_locks_per_transaction yet again?

thanks,
Kiriakos Georgiou

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Scott Whitney 2016-02-10 21:57:22 Re: is it cool to restart servers as preventive maintenance?
Previous Message Tom Lane 2016-02-10 20:18:48 Re: is it cool to restart servers as preventive maintenance?