Re: Trimming transaction logs after extended WAL archive failures

From: Steven Schlansker <steven(at)likeness(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org postgresql" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Trimming transaction logs after extended WAL archive failures
Date: 2014-03-26 16:44:05
Message-ID: D0117159-87B3-4CF0-864E-05DD52570B45@likeness.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


On Mar 26, 2014, at 9:04 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> On Tue, Mar 25, 2014 at 6:33 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Tuesday, March 25, 2014, Steven Schlansker <steven(at)likeness(dot)com> wrote:
> Hi everyone,
>
> I have a Postgres 9.3.3 database machine. Due to some intelligent work on the part of someone who shall remain nameless, the WAL archive command included a ‘> /dev/null 2>&1’ which masked archive failures until the disk entirely filled with 400GB of pg_xlog entries.
>
> PostgreSQL itself should be logging failures to the server log, regardless of whether those failures log themselves.
>
>
> I have fixed the archive command and can see WAL segments being shipped off of the server, however the xlog remains at a stable size and is not shrinking. In fact, it’s still growing at a (much slower) rate.
>
> The leading edge of the log files should be archived as soon as they fill up, and recycled/deleted two checkpoints later. The trailing edge should be archived upon checkpoints and then recycled or deleted. I think there is a throttle on how many off the trailing edge are archived each checkpoint. So issues a bunch of "CHECKPOINT;" commands for a while and see if that clears it up.

Indeed, forcing a bunch of CHECKPOINTS started to get things moving again.

>
> Actually my description is rather garbled, mixing up what I saw when wal_keep_segments was lowered, not when recovering from a long lasting archive failure. Nevertheless, checkpoints are what provoke the removal of excessive WAL files. Are you logging checkpoints? What do they say? Also, what is in pg_xlog/archive_status ?
>

I do log checkpoints, but most of them recycle and don’t remove:
Mar 26 16:09:36 prd-db1a postgres[29161]: [221-1] db=,user= LOG: checkpoint complete: wrote 177293 buffers (4.2%); 0 transaction log file(s) added, 0 removed, 56 recycled; write=539.838 s, sync=0.049 s, total=539.909 s; sync files=342, longest=0.015 s, average=0.000 s

That said, after letting the db run / checkpoint / archive overnight, the xlog did indeed start to slowly shrink. The pace at which it is shrinking is somewhat unsatisfying, but at least we are making progress now!

I guess if I had just been patient I could have saved some mailing list traffic. But patience is hard when your production database system is running at 0% free disk :)

Thanks everyone for the help, if the log continues to shrink, I should be out of the woods now.

Best,
Steven

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Brian Crowell 2014-03-26 16:54:55 Re: PG choosing nested loop for set membership?
Previous Message Tom Lane 2014-03-26 16:43:05 Re: PG choosing nested loop for set membership?