Re: pg_xlog-files not deleted on a standby after accidental server crash

From: naveen kumar <mnaveendba2(at)gmail(dot)com>
To: Paul Dunkler <paul(dot)dunkler(at)xyrality(dot)com>
Cc: Venkata Balaji N <nag1010(at)gmail(dot)com>, "pgsql-admin(at)postgresql(dot)org" <pgsql-admin(at)postgresql(dot)org>
Subject: Re: pg_xlog-files not deleted on a standby after accidental server crash
Date: 2015-03-30 14:02:47
Message-ID: CADye41KkZ7WMD6fJa=sKAkM6FY+BVo=FTiAzVi45a3Dq8CLQQw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi Paul,

May I know, how did you setup replication ? are you using cascade
replication ?

Thanks & Regards,
M Naveen Kuamr,
PostgreSQL Database Administrator,
Mobile Number: +91 7755929449.

On Mon, Mar 30, 2015 at 6:13 PM, Paul Dunkler <paul(dot)dunkler(at)xyrality(dot)com>
wrote:

> >
> > i am currently encountering a weird problem:
> >
> > 3 Days ago we had an accidental server crash on one of our database
> servers (system just restarted).
> > After it was up again, i restarted databases (we are running 4 instances
> of postgres on a 512GB / 64 Processor server). All databases came up fine
> again.
> > Afterwards i restarted the slave postgres instances (running on another
> physical server) and they could recover correctly and went ahead with
> receiving WAL via streaming replication. All fine so far.
> >
> > Then i realized that only on 1 of the 4 slave instances, the WAL-File
> count under data/pg_xlog was raising and raising, seems like WAL files are
> not reused / deleted by postgres.
> >
> > This is weird. Do you see this behavior on master too ?
>
> No - thats the strange thing. I don't have the problem on the master side.
> No long running transactions or something like that (we even have a script
> which cares about these).
>
> > How many files do you see in pg_xlog location at any point-of-time ?
>
> The number is just growing and growing and growing... We let it grow for
> like 4 days and then we had like 3000 files in pg_xlog on the slave...
>
> > The number-of-files are much bigger than checkpoint_segments configured
> ??
>
> checkpoint_segments is set to 40...
>
> > Afterwards i tried different things:
> > 1. Restarted the slave multiple times without any effect on the xlog
> behaviour
> > 2. Setting up the slave again from a fresh base backup of the master
> instance
> > 3. Checking file permissions, even tried manually deleting a WAL-File
> under data/pg_xlog (which worked)
> >
> > Now i'm running out of ideas why these files are not reused/deleted
> anymore.
> > I don't see any errors in the logfiles (not on slave, not on master).
> >
> > Slave is only showing normal "restartpoint" logs:
> >> Mar 24 10:51:10 ct-db1b ct-db1-1[65432]: [3-1] db=, user=, host= LOG:
> restartpoint starting: time
> >> Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [4-1] db=, user=, host= LOG:
> restartpoint complete: wrote 21264 buffers (0.4%); 0 transaction log
> file(s) added, 0 removed, 16 recycled; write=149.345 s, sync=2.217 s,
> total=151.682 s; sync files=1156, longest=0.213 s, average=0.001 s
> >> Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [5-1] db=, user=, host= LOG:
> recovery restart point at C3/E9AD3258
> >> Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [5-2] db=, user=, host=
> DETAIL: last completed transaction was at log time 2015-03-24
> 09:53:42.247769+00
> >> Mar 24 10:56:10 ct-db1b ct-db1-1[65432]: [6-1] db=, user=, host= LOG:
> restartpoint starting: time
> >> Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [7-1] db=, user=, host= LOG:
> restartpoint complete: wrote 6402 buffers (0.1%); 0 transaction log file(s)
> added, 0 removed, 3 recycled; write=149.516 s, sync=1.101 s, total=150.757
> s; sync files=729, longest=0.160 s, average=0.001 s
> >> Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [8-1] db=, user=, host= LOG:
> recovery restart point at C3/EDA836D8
> >> Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [8-2] db=, user=, host=
> DETAIL: last completed transaction was at log time 2015-03-24
> 09:58:40.611415+00
> >> Mar 24 11:01:10 ct-db1b ct-db1-1[65432]: [9-1] db=, user=, host= LOG:
> restartpoint starting: time
> >> Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [10-1] db=, user=, host= LOG:
> restartpoint complete: wrote 11746 buffers (0.2%); 0 transaction log
> file(s) added, 0 removed, 5 recycled; write=149.652 s, sync=0.936 s,
> total=150.725 s; sync files=672, longest=0.147 s, average=0.001 s
> >> Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [11-1] db=, user=, host= LOG:
> recovery restart point at C3/F046DA18
> >> Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [11-2] db=, user=, host=
> DETAIL: last completed transaction was at log time 2015-03-24
> 10:03:40.914148+00
> >> Mar 24 11:06:10 ct-db1b ct-db1-1[65432]: [12-1] db=, user=, host= LOG:
> restartpoint starting: time
> >> Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [13-1] db=, user=, host= LOG:
> restartpoint complete: wrote 11155 buffers (0.2%); 0 transaction log
> file(s) added, 0 removed, 4 recycled; write=149.780 s, sync=1.184 s,
> total=151.080 s; sync files=758, longest=0.184 s, average=0.001 s
> >> Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [14-1] db=, user=, host= LOG:
> recovery restart point at C3/F3C51B40
> >> Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [14-2] db=, user=, host=
> DETAIL: last completed transaction was at log time 2015-03-24
> 10:08:41.087379+00
> >> Mar 24 11:11:10 ct-db1b ct-db1-1[65432]: [15-1] db=, user=, host= LOG:
> restartpoint starting: time
> >> Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [16-1] db=, user=, host= LOG:
> restartpoint complete: wrote 12602 buffers (0.2%); 0 transaction log
> file(s) added, 0 removed, 2 recycled; write=149.632 s, sync=1.309 s,
> total=151.067 s; sync files=728, longest=0.203 s, average=0.001 s
> >> Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [17-1] db=, user=, host= LOG:
> recovery restart point at C3/F93AC8F0
> >> Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [17-2] db=, user=, host=
> DETAIL: last completed transaction was at log time 2015-03-24
> 10:13:41.168975+00
> >> Mar 24 11:16:10 ct-db1b ct-db1-1[65432]: [18-1] db=, user=, host= LOG:
> restartpoint starting: time
> >> Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [19-1] db=, user=, host= LOG:
> restartpoint complete: wrote 10234 buffers (0.2%); 0 transaction log
> file(s) added, 0 removed, 0 recycled; write=149.411 s, sync=1.175 s,
> total=150.703 s; sync files=821, longest=0.142 s, average=0.001 s
> >> Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [20-1] db=, user=, host= LOG:
> recovery restart point at C3/FDD61320
> >> Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [20-2] db=, user=, host=
> DETAIL: last completed transaction was at log time 2015-03-24
> 10:18:41.052593+00
> >> Mar 24 11:21:10 ct-db1b ct-db1-1[65432]: [21-1] db=, user=, host= LOG:
> restartpoint starting: time
> >> Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [22-1] db=, user=, host= LOG:
> restartpoint complete: wrote 9105 buffers (0.2%); 0 transaction log file(s)
> added, 0 removed, 0 recycled; write=149.598 s, sync=0.842 s, total=150.579
> s; sync files=687, longest=0.061 s, average=0.001 s
> >> Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [23-1] db=, user=, host= LOG:
> recovery restart point at C4/290A078
> >> Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [23-2] db=, user=, host=
> DETAIL: last completed transaction was at log time 2015-03-24
> 10:23:40.446308+00
> > Looking at your logs, it does make sense why the WAL files are not
> getting deleted. If you observe you logfile - the checkpoint-log-message
> always says "0 removed and # recycled", this precisely means, files are
> getting reused and are not getting deleted. I am curious to know how did
> you get to a conclusion that WAL files are not getting reused.
>
> Because i knew the behaviour from bevore and i see the behaviour from the
> other instances... Can't be that the number is just growing and growing and
> growing...
>
> >
> > If you are concerned about number of files in pg_xlog, then it must be
> due to huge checkpoint_segments or long checkpoint_timeout or you may have
> wal_keep_segments configured (i am completely not sure about this parameter
> effect on standby).
>
> as i said, checkpoint_segments is 40, checkpoint_timeout is 5 min and
> wal_keep_segments is set to 32
>
> > Ideally, CHECKPOINT process is responsible for deleting the WAL files
> from pg_xlog location periodically and it takes in to account
> WAL_KEEP_SEGMENTS. If you see any big transactions generating huge WAL
> files, then you must notice the same behavior on master as well.
>
> For now we set the archive_cleanup command on the slave for deleting the
> xlogs under data/pg_xlog - But it shouldn't be like that - This should
> happen automatically...
>
> Still out of ideas...
>
>
> --
> Mit freundlichen Grüßen / Kind regards
>
> Paul Dunkler
>
>

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Paul Dunkler 2015-03-30 15:58:05 Re: pg_xlog-files not deleted on a standby after accidental server crash
Previous Message Paul Dunkler 2015-03-30 12:43:02 Re: pg_xlog-files not deleted on a standby after accidental server crash