Quick Links

Re: Recovery inconsistencies, standby much larger than primary

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Stark <stark(at)mit(dot)edu>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Recovery inconsistencies, standby much larger than primary
Date:	2014-01-31 19:21:49
Message-ID:	19641.1391196109@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Greg Stark <stark(at)mit(dot)edu> writes:
> So just to summarize, this xlog record:
> [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
> info:8, prev:EA1/635290] insert_leaf: s/d/r:1663/16385/1261982 tid
> 3634978/282
> [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
> info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982
> blk:3634978 hole_off/len:1240/2072

> Appears to have been written to [ block 7141472 ]

I've been staring at the code for a bit trying to guess how that could
have happened. Since the WAL record has a backup block, btree_xlog_insert
would have passed control to RestoreBackupBlock, which would call
XLogReadBufferExtended with mode RBM_ZERO, so there would be no complaint
about writing past the end of the relation. Now, you can imagine some
very low-level error causing a write to go to the wrong page due to a seek
problem or some such, but it's hard to credit that that would've resulted
in creation of all the intervening segment files. Some level of our code
had to have thought it was being told to extend the relation.

However, on closer inspection I was a bit surprised to realize that there
are two possible candidates for doing that! XLogReadBufferExtended will
extend the relation, a block at a time, if told to write a page past
the current nominal EOF. And in md.c, _mdfd_getseg will *also* extend
the relation if we're InRecovery, even though it normally would not do
so when called from mdwrite().

Given the behavior in XLogReadBufferExtended, I rather think that the
InRecovery special case in _mdfd_getseg is dead code and should be
removed. But for the purpose at hand, it's more interesting to try to
confirm which of these code levels did the extension. I notice that
_mdfd_getseg only bothers to write the last physical page of each segment,
whereas XLogReadBufferExtended knows nothing of segments and will
ploddingly write every page. So on a filesystem that supports "holes"
in files, I'd expect that the added segments would be fully allocated
if XLogReadBufferExtended did the deed, but they'd be quite small if
_mdfd_getseg did so. The du results you started with suggest that the
former is the case, but could you verify that the filesystem this is
on supports holes and that du will report only the actually allocated
space when there's a hole?

Assuming that the extension was done in XLogReadBufferExtended, we are
forced to the conclusion that XLogReadBufferExtended was passed a bad
block number (viz 7141472); and it's pretty hard to see how that could
happen. RestoreBackupBlock is just passing the value it got out of the
WAL record. I thought about the idea that it was wrong about exactly
where the BkpBlock struct was in the record, but that would presumably
lead to garbage relnode and fork numbers not just a bad block number.

So I'm still baffled ...

regards, tom lane

In response to

Re: Recovery inconsistencies, standby much larger than primary at 2014-01-31 16:09:24 from Greg Stark

Responses

Re: Recovery inconsistencies, standby much larger than primary at 2014-01-31 20:28:31 from Greg Stark
Re: Recovery inconsistencies, standby much larger than primary at 2014-02-01 09:36:25 from Greg Stark

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Bruce Momjian	2014-01-31 19:22:38	Re: postgres FDW cost estimation options unrecognized in 9.3-beta1
Previous Message	Peter Geoghegan	2014-01-31 19:11:23	Re: Add min and max execute statement time in pg_stat_statement