Re: BUG #16605: PostgreSQL recovery startup process waiting and blocking to application queries

From: Tushar Takate <tushar11(dot)takate(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16605: PostgreSQL recovery startup process waiting and blocking to application queries
Date: 2020-09-11 16:48:27
Message-ID: CAE4W+S+GLYOpDF412JcP8BbJ6RE==fYRGTGyiQs4N6HTsajOag@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi Kyotaro ,

Thanks for sharing your thoughts on this issue .

Is it expected to arrive again with the below setting or is this the bug
with this setting ?

hot_standby_feedback= off and max_standby_streaming_delay -1 .

Because we have kept this setting intentionally on this server to avoid the
bloat and query termination after some set time .

If it's the bug ,Request you to please let me know which version contains
the fix for it or in which version fix is going to arrive .

-
Thanks & Regards,
Tushar K Takate .

On Thu, Sep 3, 2020 at 8:52 AM Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
wrote:

> At Wed, 02 Sep 2020 10:58:49 +0000, PG Bug reporting form <
> noreply(at)postgresql(dot)org> wrote in
> > The following bug has been logged on the website:
> >
> > Bug reference: 16605
> > Logged by: Tushar Takate
> > Email address: tushar11(dot)takate(at)gmail(dot)com
> > PostgreSQL version: 10.5
> > Operating system: OEL 6.9
> > Description:
> >
> > Hi Team ,
> >
> > Problem/bug/issues statement : - PostgreSQL recovery startup process is
> > blocking the application queries .
> >
> > Env Details -
> >
> > PostgtreSQL community version - 10.5
> > OS - OEL 6.9
> > Replication type - Streaming - async
> >
> > DB parameters at replica side -
> >
> > max_standby_streaming_delay = -1
> > max_standby_archive_delay = -1
> > hot_standby_feedback = off
> >
> > Startup process by ps -ef
> >
> > postgres 24489 24487 0 2019 ? 1-20:23:35 postgres:
> > startup process recovering 00000006000021D4000000ED waiting
>
> This is the cause of all succeeding blocking chain. "recovering xxx
> waiting" means the process is waiting for a recovery-conflict to be
> resolved. In other words, the startup process is being blocked by some
> backend. In your case the startup process should have taken
> AccessExclusiveLock on the relation 17280 before being blocked. With
> the setting max_standby_streaming_delay -1, the tartup process waits
> forever.
>
> The most common cause of recovery-conflict when "hot_standby_feedback
> = off" is snapshot conflict. In other words, vacuum on the primary
> side cannot be replayed on the standby since some transaction on the
> standby side may refer to-be-vacuumed table rows.
>
> I'm not sure about easier way but the cause can be examined by the
> following steps.
>
> =# select sent_lsn, replay_lsn, file, upper(to_hex(off)) from
> pg_stat_replication, lateral pg_walfile_name_offset(replay_lsn) as o(file,
> off);
> sent_lsn | replay_lsn | file | to_hex
> -----------+------------+--------------------------+--------
> 0/5882B30 | 0/5874878 | 000000010000000000000005 | 874878
>
> You will see that the replay_lsn is behind sent_lsn. and the last two
> columns show the location of the blocked record.
>
> $ pg_waldump .../000000010000000000000005 | grep 874878
> rmgr: Heap2 len (rec/tot): 506/ 506, tx: 0, lsn:
> 0/05874878, prev 0/05874850, desc: CLEAN remxid 600, blkref #0: rel
> 1663/12348/16385 blk 0
>
> The Heap2-CLEAN is emitted by vacuum. You can avoid this kind of
> conflict by turning hot_standby_feedback on and/or preferably setting
> max_standby_streaming_delay to an effective value breaks any kind of
> conflicts by terminating conflicting backends. As a workaround after
> snapshot-conflict happens, manually terminate backends with
> backend_xmin <= 600, then startup will continue recovery.
>
> Another common cause is another access exclusive lock that is blocked
> by standby transaction. In this case you will see Standby-LOCK as the
> problem WAL record but pg_locks also shows such lock conflicts.
>
> I think other kinds of recovery-conflicts rarely happens.
>
> > All are the pid which are in wait queue are executing the select queries
> on
> > same table that is (mytable) .
> >
> > Can you please help us to understand the issues and possible
> > solution/workaround/fix for it .
> >
> > For temp fix we have restarted the DB service ( Before that we have
> tried to
> > kill the process by -15 flag which not worked )
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2020-09-12 00:21:49 Re: [BUG v13] Crash with event trigger in extension
Previous Message Tom Lane 2020-09-11 14:21:21 Re: BUG #16614: Stale temporary objects makes vacuum ineffective when 1 million transactions remain