Re: Hot Standby (v9d)

From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hot Standby (v9d)
Date: 2009-01-28 18:55:06
Message-ID: 87mydbjm5x.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


I skimmed through the Hot Standby patch for a preliminary review. I noted the
following things, some minor tweaks, some just questions. None of the things I
noted are big issues unless some of the questions uncover issues.

1) This code is obviously a cut-pasto:

+ else if (strcmp(tok1, "max_standby_delay") == 0)
+ {
+ errno = 0;
+ maxStandbyDelay = (TransactionId) strtoul(tok2, NULL, 0);
+ if (errno == EINVAL || errno == ERANGE)
+ ereport(FATAL,
+ (errmsg("max_standby_delay is not a valid number: \"%s\"",
+ tok2)));

Have you ever tested the behaviour with max_standby_delay = -1 ?

Also, the default max_standby_delay is currently 0 -- ie, kill queries
immediately upon detecting any conflicts at all -- which I don't really think
anyone would be happy with. I still *strongly* feel the default has to be the
non-destructive conservative -1.

2) This hard coded constant of 500ms seems arbitrary to me. If your use case
is a heavily-used reporting engine you might get this message all the time. I
think this either has to be configurable (warn_standby_delay?) or based on
max_standby_delay or some other parameter somehow.

+ /*
+ * log that we have been waiting for a while now...
+ */
+ if (!logged && standbyWait_ms > 500)

3) These two blocks of code seem unsatisfactory:

! /*
! * Keep track of the block number of the lastBlockVacuumed, so
! * we can scan those blocks as well during WAL replay. This then
! * provides concurrency protection and allows btrees to be used
! * while in recovery.
! */
! if (lastBlockVacuumed > vstate->lastBlockVacuumed)
! vstate->lastBlockVacuumed = lastBlockVacuumed;
!

+ /*
+ * XXXHS we don't actually need to read the block, we
+ * just need to confirm it is unpinned. If we had a special call
+ * into the buffer manager we could optimise this so that
+ * if the block is not in shared_buffers we confirm it as unpinned.
+ */
+ buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, blkno, RBM_NORMAL);
+ if (BufferIsValid(buffer))
+ {
+ LockBufferForCleanup(buffer);
+ UnlockReleaseBuffer(buffer);
+ }

Aside from the XXX comment (I thought we actually had such a call now, but if
not shouldn't we just add one instead of carping?) I'm not convinced this
handles all the cases that can arise. Notable, what happens if a previous
vacuum died in the middle of the scan?

I think we have a vacuum id which we use already with btrees to the same
issue. It seems to me this be more robust if you stamped the xlog record with
that id number and ensured it matched the id of the lastBloockVacuumed? Then
you could start from 0 if the id changes.

4) Why is this necessary?

+ if (IsRecoveryProcessingMode() &&
+ locktag->locktag_type == LOCKTAG_OBJECT &&
+ lockmode > AccessShareLock)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot acquire lockmode %s on database objects while recovery is in progress",
+ lockMethodTable->lockModeNames[lockmode]),
+ errhint("Only AccessShareLock can be acquired on database objects during recovery.")));
+

Obviously we can't lock records (SELECT FOR UPDATE) since that requires write
access. But shouldn't we be able to do manual LOCK TABLE calls?

I hope this isn't the only interlock we have against trying to do write i/o or
DDL against tables...?

5) The code for killing off conflicting transactions looks odd to me but I
haven't really traced through it to see what it's doing. It seems to kill off
just one process? What if there are several holding the lock in question?
Also, it doesn't seem to take into account how long the various transactions
have held the lock?

Is there a risk of, for example, if you have a long report running against a
table and lots of OLTP requests against the same table it seems the conflict
resolution code would fire randomly into the crowd and hit mostly innocent
OLTP transactions until eventually it found the culprit report?

Also, it kills of backends using SIGINT which I *think* Tom went through and
made safe earlier this release, right?

In any case if the backend doesn't die off promptly we wait forever with no
further warnings or log messages. I would think we should at least print some
sort of message here, even if it's a "can't happen" elog. The doubling thing
is probably unnecessary too in this case.

+ if (!XLogRecPtrIsValid(conflict_LSN))
+ {
+ /* wait awhile for it to die */
+ pg_usleep(wontDieWait * 5000L);
+ wontDieWait *= 2;
+ }
+ }

6) I still don't understand why you need unobserved_xids. We don't need this
in normal running, an xid we don't know for certain is committed is exactly
the same as a transaction we know is currently running or aborted. So why do
you need it during HS?

The comment says:

+ * This is very important because if we leave
+ * those xids out of the snapshot then they will appear to be already complete.
+ * Later, when they have actually completed this could lead to confusion as to
+ * whether those xids are visible or not, blowing a huge hole in MVCC.
+ * We need 'em.

But that doesn't sound rational to me. I'm not sure what "confusion" this
would cause. If they later actually complete then any existing snapshots would
still not see them. And any later snapshots wouldn't be confused by the
earlier conclusions.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gregory Stark 2009-01-28 19:02:37 Re: How to get SE-PostgreSQL acceptable
Previous Message D'Arcy J.M. Cain 2009-01-28 18:51:06 Re: Output filter for psql