Re: Streaming replication - unable to stop the standby

From: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Streaming replication - unable to stop the standby
Date: 2010-05-03 18:45:59
Message-ID: 4BDF19E7.2040205@kaltenbrunner.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas wrote:
> On Mon, May 3, 2010 at 2:22 PM, Stefan Kaltenbrunner
> <stefan(at)kaltenbrunner(dot)cc> wrote:
>> Tom Lane wrote:
>>> Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
>>>> I'm currently testing SR/HS in 9.0beta1 and I noticed that it seems quite
>>>> easy to end up in a situation where you have a standby that seems to be
>>>> stuck in:
>>>> $ psql -p 5433
>>>> psql: FATAL: the database system is shutting down
>>>> but not not actually shuting down ever. I ran into that a few times now
>>>> (mostly because I'm trying to chase a recovery issue I hit during earlier
>>>> testing) by simply having the master iterate between a pgbench run and
>>>> "idle" while simple doing pg_ctl restart in a loop on the standby.
>>>> I do vaguely recall some discussions of that but I thought the issue git
>>>> settled somehow?
>>> Hm, I haven't pushed this hard but "pg_ctl stop" seems to stop the
>>> standby for me. Which subprocesses of the slave postmaster are still
>>> around? Could you attach to them with gdb and get stack traces?
>> it is not always failing to shut down - it only fails sometimes - I have not
>> exactly pinpointed yet what it is causing this but the standby is in a weird
>> state now:
>>
>> * the master is currently idle
>> * the standby has no connections at all
>>
>> logs from the standby:
>>
>> FATAL: the database system is shutting down
>> FATAL: the database system is shutting down
>> FATAL: replication terminated by primary server
>> LOG: restored log file "000000010000001900000054" from archive
>> cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such
>> file or directory
>> LOG: record with zero length at 19/55000078
>> cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such
>> file or directory
>> FATAL: could not connect to the primary server: could not connect to
>> server: Connection refused
>> Is the server running on host "localhost" and accepting
>> TCP/IP connections on port 5432?
>> could not connect to server: Connection refused
>> Is the server running on host "localhost" and accepting
>> TCP/IP connections on port 5432?
>>
>> cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such
>> file or directory
>> cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such
>> file or directory
>> LOG: streaming replication successfully connected to primary
>> FATAL: the database system is shutting down
>>
>>
>> the first two "FATAL: the database system is shutting down" are from me
>> trying to connect using psql after i noticed that pg_ctl failed to shutdown
>> the slave.
>> The next thing I tried was restarting the master - which lead to the
>> following logs and the standby noticing that and reconnecting but you cannot
>> actually connect...
>>
>> process tree for the standby is:
>>
>> 29523 pts/2 S 0:00 /home/postgres9/pginst/bin/postgres -D
>> /mnt/space/pgdata_standby
>> 29524 ? Ss 0:06 \_ postgres: startup process waiting for
>> 000000010000001900000055
>> 29529 ? Ss 0:00 \_ postgres: writer process
>> 29835 ? Ss 0:00 \_ postgres: wal receiver process streaming
>> 19/55000078
>
> <uninformed-speculation>
>
> Hmm. When I committed that patch to fix smart shutdown on the
> standby, we discussed the fact that the startup process can't simply
> release its locks and die at shutdown time because the locks it holds
> prevent other backends from seeing the database in an inconsistent
> state. Therefore, if we were to terminate recovery as soon as the
> smart shutdown request is received, we might never complete, because a
> backend might be waiting on a lock that will never get released. If
> that's really a danger scenario, then it follows that we might also
> fail to shut down if we can't connect to the primary, because we might
> not be able to replay enough WAL to release the locks the remaining
> backends are waiting for. That sort of looks like what is happening
> to you, except based on your test scenario I can't figure out where
> this came from:
>
> FATAL: replication terminated by primary server

as I said before I restarted the master at that point, the standby
logged the above, restored 000000010000001900000054 from the archive,
tried reconnecting and logged the "connection refused". a few seconds
later the master was up again and the standby succeeded in reconnecting.

Stefan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2010-05-03 18:47:14 Re: Streaming replication - unable to stop the standby
Previous Message Robert Haas 2010-05-03 18:40:36 Re: Streaming replication - unable to stop the standby