Re: BUG #7902: lazy cleanup of extraneous WAL files can cause out of disk issues

From: Jeff Frost <jeff(at)pgexperts(dot)com>
To: Rafael Martinez Guerrero <r(dot)m(dot)guerrero(at)usit(dot)uio(dot)no>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #7902: lazy cleanup of extraneous WAL files can cause out of disk issues
Date: 2013-02-24 21:55:44
Message-ID: D2DE538F-32D2-4F70-9260-1FDAAE0F219D@pgexperts.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs


On Feb 24, 2013, at 1:05 PM, Jeff Frost <jeff(at)pgexperts(dot)com> wrote:

>
> On Feb 24, 2013, at 7:16 AM, Rafael Martinez Guerrero <r(dot)m(dot)guerrero(at)usit(dot)uio(dot)no> wrote:
>
>> We reported this back in 2011, but we did not get to any conclusion:
>> http://www.postgresql.org/message-id/4DE89072.7070305@usit.uio.no
>>
>> In our case, we had this problem when creating a GIN index.
>>
>> I think the problem has something to do with checkpoints, I think the
>> number of WAL files will grow beyond the limit defined because the
>> system can not be finish with checkpoints. A good start to try to
>> explain what it is happening would be to try to identify or describe the
>> situations where checkpoints can take very long to complete or fail
>> altogether.
>>
>> We are interested in this subject because we have had this problem a few
>> times. But we have not been able to identify the reason that triggers this.
>
> Rafael, did you do a followup post on pgsql-performance, or did the thread die at the end of the one you post above?

This is how the log_checkpoint output looks during the run:

2013-02-24 21:06:31.156 UTC,,,1624,,51282598.658,114,,2013-02-23 02:12:40 UTC,,0,LOG,00000,"checkpoint starting: immediate force wait",,,,,,,,,""
2013-02-24 21:06:31.216 UTC,,,1624,,51282598.658,115,,2013-02-23 02:12:40 UTC,,0,LOG,00000,"checkpoint complete: wrote 108 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 4 recycled; write=0.054 s, sync=0.002 s, total=0.059 s; sync files=10, longest=0.000 s, average=0.000 s",,,,,,,,,""

So, this ^^^^^ is the manual checkpoint prior to the pgbench run and everything else is part of the pgbench run.

2013-02-24 21:06:45.227 UTC,,,1624,,51282598.658,116,,2013-02-23 02:12:40 UTC,,0,LOG,00000,"checkpoints are occurring too frequently (14 seconds apart)",,"Consider increasing the configuration parameter ""checkpoint_segments"".",,,,,,,""
2013-02-24 21:06:45.227 UTC,,,1624,,51282598.658,117,,2013-02-23 02:12:40 UTC,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
2013-02-24 21:08:35.239 UTC,,,1624,,51282598.658,118,,2013-02-23 02:12:40 UTC,,0,LOG,00000,"checkpoint complete: wrote 266935 buffers (12.7%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=10.889 s, sync=99.079 s, total=110.011 s; sync files=29, longest=12.888 s, average=3.416 s",,,,,,,,,""
2013-02-24 21:08:35.241 UTC,,,1624,,51282598.658,119,,2013-02-23 02:12:40 UTC,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
2013-02-24 21:11:19.506 UTC,,,1624,,51282598.658,120,,2013-02-23 02:12:40 UTC,,0,LOG,00000,"checkpoint complete: wrote 801517 buffers (38.2%); 0 transaction log file(s) added, 0 removed, 128 recycled; write=15.484 s, sync=148.739 s, total=164.266 s; sync files=32, longest=11.773 s, average=4.648 s",,,,,,,,,""
2013-02-24 21:11:19.506 UTC,,,1624,,51282598.658,121,,2013-02-23 02:12:40 UTC,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
2013-02-24 21:14:20.486 UTC,,,1624,,51282598.658,122,,2013-02-23 02:12:40 UTC,,0,LOG,00000,"checkpoint complete: wrote 1009466 buffers (48.1%); 0 transaction log file(s) added, 140 removed, 257 recycled; write=13.690 s, sync=167.245 s, total=180.980 s; sync files=33, longest=14.442 s, average=5.067 s",,,,,,,,,""
2013-02-24 21:14:20.487 UTC,,,1624,,51282598.658,123,,2013-02-23 02:12:40 UTC,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
2013-02-24 21:17:31.507 UTC,,,1624,,51282598.658,124,,2013-02-23 02:12:40 UTC,,0,LOG,00000,"checkpoint complete: wrote 1055059 buffers (50.3%); 0 transaction log file(s) added, 259 removed, 257 recycled; write=22.731 s, sync=168.239 s, total=191.020 s; sync files=32, longest=13.650 s, average=5.257 s",,,,,,,,,""
2013-02-24 21:17:31.507 UTC,,,1624,,51282598.658,125,,2013-02-23 02:12:40 UTC,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
2013-02-24 21:23:36.201 UTC,,,1624,,51282598.658,126,,2013-02-23 02:12:40 UTC,,0,LOG,00000,"checkpoint complete: wrote 814767 buffers (38.9%); 0 transaction log file(s) added, 546 removed, 0 recycled; write=269.553 s, sync=95.042 s, total=364.693 s; sync files=31, longest=11.211 s, average=3.065 s",,,,,,,,,""

They're all number of xlog based and the first one doesn't remove any, that's not surprising, but I'm surprised that the second one doesn't remove any. It does recycle quite a few. I guess the issue is that it can generate WAL files way faster than it can checkpoint and cleanup. I think I need to write a script to output the number of WAL files each minute so we can better correlate what's going on.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Rafael Martinez Guerrero 2013-02-24 22:51:00 Re: BUG #7902: lazy cleanup of extraneous WAL files can cause out of disk issues
Previous Message Jeff Frost 2013-02-24 21:05:48 Re: BUG #7902: lazy cleanup of extraneous WAL files can cause out of disk issues