Re: Auto-vacuum is not running in 9.1.12

From: Prakash Itnal <prakash074(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, rasna(dot)t(at)nokia(dot)com, sandhya(dot)k_s(at)nokia(dot)com
Subject: Re: Auto-vacuum is not running in 9.1.12
Date: 2015-06-20 13:32:25
Message-ID: CAHC5u79X9z5v3fVDHeTwaAm_qBKx_fRvWKG7miw7yBiVhGFTxw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Sorry for the late response. The current patch only fixes the scenario-1
listed below. It will not address the scenario-2. Also we need a fix in
unix_latch.c where the remaining sleep time is evaluated, if latch is woken
by other events (or result=0). Here to it is possible the latch might go in
long sleep if time shifts to past time.

*Scenario-1:* current_time (2015) -> changed_to_past (1995) ->
stays-here-for-half-day -> corrected to current_time (2015)
*Scenario-2:* current_time (2015) -> changed_to_future (2020) ->
stays-here-for-half-day -> corrected to current_time (2015)

*Results: *
Scenario-1: Auto-vacuuming not done from the time system time changed to
1995 until it is corrected to current time. In current context half-day.
Scenario-2: Auto-vacuuming keeps running if system time shifts to future.
However after correcting time back to current time (from 2020->2015), the
auto-vacuuming goes into 5 year sleep. Though current patch fixes waking up
from sleep it will not allow to launch auto-vacuum worker as the dblist
still holds previously set time i.e. 2020.

*Proposed Fixes:*
*autovacuum.c:* I will rebuild_database_list if time shift is detected. The
time-shift is detected if sleep time evaluated is zero or greater than
autovacuum_naptime. Currently the list is rebuilt only if time shifts to
future. I added a check to rebuild it if sleep time is greater than
autovacuum_naptime. Secondly I included the patch from Alvaro and changed
the default 300 seconds value to autovacuum_naptime. This will avoid
multiple wakeups if autovacuum_naptime is set to greater than 300 seconds.

*unix_latch.c:* Current implementation evaluates the remaining sleep time
using "cur_timeout = timeout - (start_time - cur_time)". If the time is
shifted back to past then cur_timeout will be evaluated to long time (for
eg. start_time=2015 and cur_time=1995 then cur_timeout=timeout - (-20
years) = timeout + 20years). To avoid this wrong calculation I added a
check and treat it as timeout.

With above mentioned fixes the auto-vacuuming will be robust enough to
handle any system time changes. We tested the scenarios in our setup and
they seem to work fine. I hope these are valid fixes and they do not affect
any other flows.

Please review and share your review comments/suggestions.

PS: In our product database is used in update-heavy mode with limited disc
space. So we need to be robust to handle such time changes to avoid any
system failures due to disc full.

On Fri, Jun 19, 2015 at 10:28 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:

> On 2015-06-17 18:10:42 -0300, Alvaro Herrera wrote:
> > Yeah, the case is pretty weird and I'm not really sure that the server
> > ought to be expected to behave. But if this is actually the only part
> > of the server that misbehaves because of sudden gigantic time jumps, I
> > think it's fair to patch it. Here's a proposed patch.
>
> We probably should go through the server and look at the various sleeps
> and make sure thy all have a upper limit. I doubt this is the only
> location without one.
>
> Greetings,
>
> Andres Freund
>

--
Cheers,
Prakash

Attachment Content-Type Size
time_shift_fixes_in_autovacuum.patch application/octet-stream 3.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-06-20 13:35:39 Re: castoroides spinlock failure on test_shm_mq
Previous Message Michael Paquier 2015-06-20 08:48:31 Re: The real reason why TAP testing isn't ready for prime time