An unkillable connection caused replication delay on my replica

From: Shawn <shamccoy(at)amazon(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: An unkillable connection caused replication delay on my replica
Date: 2016-06-28 19:56:43
Message-ID: 1467143803715-5909767.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Postgres 9.4.5
Streaming replica
hot_standby_feedback=1

I received an alarm that replication was lagging many hours. When I took a
look at what was happening on the replica I found the recovery process was
"stuck" on one WAL.

8719 8717 1 Apr30 ? 15:24:49 postgres: startup process
recovering 0000000100001E9700000096 waiting

An strace of the recovery pid showed a loop trying to kill the blocking pid:

select(0, NULL, NULL, NULL, {0, 5000}) = 0 (Timeout)
gettimeofday({1465523436, 442899}, NULL) = 0
kill(6819, SIGUSR1) = 0
select(0, NULL, NULL, NULL, {0, 5000}) = 0 (Timeout)
gettimeofday({1465523436, 448134}, NULL) = 0
kill(6819, SIGUSR1)

strace of long-running query pid 6819 loops like this:

sendto(10, "<query stuff removed>" NULL, 0) = ? ERESTARTSYS (To be
restarted if SA_RESTART is set)
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=8719,
si_uid=3001} ---
rt_sigreturn()

Where pid 6819 is a long (running for 2 days) running query. In
pg_stat_activity, it was still listed as "active". The query had a horrible
execution plan and it was being executed via a python script. I couldn't
pg_terminate_backend the connection. I didn't try to "kill -9" it due to
all the warnings about that and I felt I had something special here. I
attached the debugger.

Program received signal SIGINT, Interrupt.
0x00002b89d9c62903 in __select_nocancel () from /lib64/libc.so.6
#0 0x00002b89d9c62903 in __select_nocancel () from /lib64/libc.so.6
#1 0x000000000076e6da in pg_usleep (microsec=microsec(at)entry=5000) at
pgsleep.c:53
#2 0x0000000000655f8e in ResolveRecoveryConflictWithVirtualXIDs
(waitlist=0x2b89dc238000,
reason=reason(at)entry=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT) at standby.c:247
#3 0x000000000065609e in ResolveRecoveryConflictWithVirtualXIDs
(reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, waitlist=<optimized out>) at
standby.c:283
#4 ResolveRecoveryConflictWithSnapshot (latestRemovedXid=<optimized out>,
node=...) at standby.c:281
#5 0x000000000048ff7d in heap_xlog_clean (record=0x2b89dc1295c0,
lsn=33636416003112) at heapam.c:7329
#6 heap2_redo (lsn=33636416003112, record=0x2b89dc1295c0) at heapam.c:8487
#7 0x00000000004c5fad in StartupXLOG () at xlog.c:6948
#8 0x000000000061f2bf in StartupProcessMain () at startup.c:224
#9 0x00000000004d0188 in AuxiliaryProcessMain (argc=argc(at)entry=2,
argv=argv(at)entry=0x7fff98c4cba0) at bootstrap.c:422
#10 0x000000000061c8de in StartChildProcess (type=StartupProcess) at
postmaster.c:5256
#11 0x000000000061edd2 in PostmasterMain (argc=argc(at)entry=7,
argv=argv(at)entry=0x2b89dc024040) at postmaster.c:1292
#12 0x00000000004622e5 in main (argc=7, argv=0x2b89dc024040) at main.c:228
Continuing.

Which I believe puts me in this section of code:


https://github.com/postgres/postgres/blob/REL9_4_STABLE/src/backend/storage/ipc/standby.c#L232

The customer was able to restart recovery but only after they "kill -9" the
python script on their application host.

My question: Is there something special about this connection that prevents
recovery from being able to kill it like it wanted to? I am trying to repro
it, but, so far my connections get terminated properly. Can/should Postgres
be more forceful when its determined WaitExceedsMaxStandbyDelay is true and
a session shouldn't be blocking recovery?

Shawn

--
View this message in context: http://postgresql.nabble.com/An-unkillable-connection-caused-replication-delay-on-my-replica-tp5909767.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2016-06-28 20:05:11 Re: An unkillable connection caused replication delay on my replica
Previous Message Bruce Momjian 2016-06-28 18:22:34 Re: Questionabl description in datatype.sgml