Re: Strange query stalls on replica in 9.3.9

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Strange query stalls on replica in 9.3.9
Date: 2015-08-13 20:59:42
Message-ID: CAMkU=1zu0dF5M8WUZqcCPWX6f572AsAv7X_WVteYvaWGYm=Qfg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Thu, Aug 13, 2015 at 10:09 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:

> Setup:
>
> * PostgreSQL 9.3.9
> * 1 master, 1 replica
> * Tiny database, under 0.5GB, completely cached in shared_buffers
> * 90% read query traffic, which is handled by replica
> * Traffic in the 1000's QPS.
>
> The wierdness:
>
> Periodically the master runs an "update all rows" query on the main
> table in the database. When this update hits the replica via
> replication stream, *some* (about 5%) of the queries which do seq scans
> will stall for 22 to 32 seconds (these queries normally take about
> 75ms). Queries which do index scans seem not to be affected.
>
> Thing is, the update all rows only takes 2.5 seconds to execute on the
> master. So even if the update is blocking the seq scans on the replica
> (and I can't see why it would), it should only block them for < 3 seconds.
>
> Anyone seen anything like this?
>

Sounds like another manifestation of this: "[PERFORM] Planner performance
extremely affected by an hanging transaction (20-30 times)?"

http://www.postgresql.org/message-id/CAMkU=1yy-YEQVvqj2xJitT1EFkyuFk7uTV_hrOMGyGMxpU=N+Q@mail.gmail.com

Each backend that does a seqscan, for each tuple it scans which is not yet
resolved (which near the end of the bulk update is going to be nearly equal
to 2*reltuples, as every tuple has both an old and a new version so one
xmax from one and one xmin from the other must be checked), it has to lock
and scan the proc array lock to see if the tuple-inserting transaction has
committed yet. This creates profound contention on the lock. Every
scanning backend is looping over every other backend for every tuple

Once the commit of the whole-table update has replayed, the problem should
go way instantly because at that point each backend doing the seqscan will
find the the transaction has committed and so will set the hint bit that
means all of the other seqscan backends that come after it can skip the
proc array scan for that tuple.

So perhaps the commit of the whole-table update is delayed because the
startup process as also getting bogged down on the same contended lock? I
don't know how hard WAL replay hits the proc array lock.

Cheers,

Jeff

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message robbyc 2015-08-14 03:15:01 Re: Slow Query
Previous Message Kevin Grittner 2015-08-13 20:24:28 Re: Strange query stalls on replica in 9.3.9