Quick Links

Re: max_standby_delay considered harmful

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Florian Pflug <fgp(at)phlo(dot)org>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, Bruce Momjian <bruce(at)momjian(dot)us>, Greg Smith <greg(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: max_standby_delay considered harmful
Date:	2010-05-12 19:10:28
Message-ID:	4BEAFD24.2070707@kaltenbrunner.cc
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 05/12/2010 05:28 PM, Simon Riggs wrote:
> On Wed, 2010-05-12 at 14:18 +0100, Simon Riggs wrote:
>> On Wed, 2010-05-12 at 08:52 -0400, Robert Haas wrote:
>>> On Wed, May 12, 2010 at 7:26 AM, Simon Riggs<simon(at)2ndquadrant(dot)com> wrote:
>>>> On Wed, 2010-05-12 at 07:10 -0400, Robert Haas wrote:
>>>>
>>>>> I'm not sure what to make of this. Sometimes not shutting down
>>>>> doesn't sound like a feature to me.
>>>>
>>>> It acts exactly the same in recovery as in normal running. It is not a
>>>> special feature of recovery at all, bug or otherwise.
>>>
>>> Simon, that doesn't make any sense. We are talking about a backend
>>> getting stuck forever on an exclusive lock that is held by the startup
>>> process and which will never be released (for example, because the
>>> master has shut down and no more WAL can be obtained for replay). The
>>> startup process does not hold locks in normal operation.
>>
>> When I test it, startup process holding a lock does not prevent shutdown
>> of a standby.
>>
>> I'd be happy to see your test case showing a bug exists and that the
>> behaviour differs from normal running.
>
> Let me put this differently: I accept that Stefan has reported a
> problem. Neither Tom nor myself can reproduce the problem. I've re-run
> Stefan's test case and restarted the server more than 400 times now
> without any issue.
>
> I re-read your post where you gave what you yourself called "uninformed
> speculation". There's no real polite way to say it, but yes your
> speculation does appear to be uninformed, since it is incorrect. Reasons
> would be not least that Stefan's tests don't actually send any locks to
> the standby anyway (!), but even if they did your speculation as to the
> cause is still all wrong, as explained.
>
> There is no evidence to link this behaviour with HS, as yet, and you
> should be considering the possibility the problem lies elsewhere,
> especially since it could be code you committed that is at fault.

Well I'm not sure why people seem to have that hard a time reproducing
that issue - it seems that I can provoke it really trivially(in this
case no loops, no pgbench, no tricks). A few minutes ago I logged into
my test standby (which is idle except for the odd connect to template1
caused by nagios - the master is idle as well and has been for days):

postgres(at)soldata005:~$ psql
psql (9.0beta1)
Type "help" for help.

postgres=# select 1;
?column?
----------
1
(1 row)

postgres=# \q
postgres(at)soldata005:~$ pg_ctl -D /var/lib/postgresql/9.0b1/main/ restart
waiting for server to shut down.... done
server stopped
server starting
postgres(at)soldata005:~$ pg_ctl -D /var/lib/postgresql/9.0b1/main/ restart
waiting for server to shut down.... done
server stopped
server starting
postgres(at)soldata005:~$ pg_ctl -D /var/lib/postgresql/9.0b1/main/ restart
waiting for server to shut
down............................................................... failed
pg_ctl: server does not shut down

the server log for that is as follows:

<2010-05-12 20:36:18.166 CEST,,,> LOG: received smart shutdown request
<2010-05-12 20:36:18.167 CEST,,,> FATAL: terminating walreceiver
process due to administrator command
<2010-05-12 20:36:18.174 CEST,,,> LOG: shutting down
<2010-05-12 20:36:18.251 CEST,,,> LOG: database system is shut down
<2010-05-12 20:36:19.706 CEST,,,> LOG: database system was interrupted
while in recovery at log time 2010-05-06 17:36:05 CEST
<2010-05-12 20:36:19.706 CEST,,,> HINT: If this has occurred more than
once some data might be corrupted and you might need to choose an
earlier recovery target.
<2010-05-12 20:36:19.706 CEST,,,> LOG: entering standby mode
<2010-05-12 20:36:19.721 CEST,,,> LOG: consistent recovery state
reached at 1/12000078
<2010-05-12 20:36:19.721 CEST,,,> LOG: invalid record length at 1/12000078
<2010-05-12 20:36:19.723 CEST,,,> LOG: database system is ready to
accept read only connections
<2010-05-12 20:36:19.737 CEST,,,> LOG: streaming replication
successfully connected to primary
<2010-05-12 20:36:19.918 CEST,,,> LOG: received smart shutdown request
<2010-05-12 20:36:19.919 CEST,,,> FATAL: terminating walreceiver
process due to administrator command
<2010-05-12 20:36:19.922 CEST,,,> LOG: shutting down
<2010-05-12 20:36:19.937 CEST,,,> LOG: database system is shut down
<2010-05-12 20:36:21.433 CEST,,,> LOG: database system was interrupted
while in recovery at log time 2010-05-06 17:36:05 CEST
<2010-05-12 20:36:21.433 CEST,,,> HINT: If this has occurred more than
once some data might be corrupted and you might need to choose an
earlier recovery target.
<2010-05-12 20:36:21.433 CEST,,,> LOG: entering standby mode
<2010-05-12 20:36:21.482 CEST,,,> LOG: received smart shutdown request
<2010-05-12 20:36:21.504 CEST,,,> LOG: consistent recovery state
reached at 1/12000078
<2010-05-12 20:36:21.504 CEST,,,> LOG: invalid record length at 1/12000078
<2010-05-12 20:36:21.505 CEST,,,> LOG: database system is ready to
accept read only connections
<2010-05-12 20:36:21.516 CEST,,,> LOG: streaming replication
successfully connected to primary

so it restarted two times successfully - however if one looks at the
third time one can see that it received the smart shutdown request
BEFORE it reached a consistent recovery state - yet it continued to
enable HS and reenabled SR as well.

The database is now sitting there doing nothing and it more or less
broken because you cannot connect to it in the current state:

~$ psql
psql: FATAL: the database system is shutting down

the startup process has the following backtrace:

(gdb) bt
#0 0x00007fbe24cb2c83 in select () from /lib/libc.so.6
#1 0x00000000006e811a in pg_usleep ()
#2 0x000000000048c333 in XLogPageRead ()
#3 0x000000000048c967 in ReadRecord ()
#4 0x0000000000493ab6 in StartupXLOG ()
#5 0x0000000000495a88 in StartupProcessMain ()
#6 0x00000000004ab25e in AuxiliaryProcessMain ()
#7 0x00000000005d4a7d in StartChildProcess ()
#8 0x00000000005d70c2 in PostmasterMain ()
#9 0x000000000057d898 in main ()

Stefan

In response to

Re: max_standby_delay considered harmful at 2010-05-12 15:28:13 from Simon Riggs

Responses

Re: max_standby_delay considered harmful at 2010-05-12 19:25:14 from Simon Riggs
Re: max_standby_delay considered harmful at 2010-05-12 19:36:07 from Alvaro Herrera

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Alvaro Herrera	2010-05-12 19:10:47	Re: pg_upgrade versus MSVC build scripts
Previous Message	Alvaro Herrera	2010-05-12 19:04:18	Re: multibyte charater set in levenshtein function