From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | pgsql-hackers(at)postgreSQL(dot)org |
Subject: | Inaccuracy in VACUUM's tuple count estimates |
Date: | 2014-06-06 19:44:25 |
Message-ID: | 10787.1402083865@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
I've been looking at the complaint Tim Wilson posted in pgsql-performance
about badly inaccurate reltuples updates coming from VACUUM. There seem
to be a number of problems leading to that. The key point is that when
VACUUM has scanned only part of the relation, it assumes that the
live-tuple density in that part of the relation is num_tuples (I'm
speaking of the counter accumulated in lazy_scan_heap) divided by
scanned_pages, and then it tries to extrapolate that information to the
rest of the relation. Now, the validity of that extrapolation is a bit
questionable given that VACUUM is considering a highly nonrandom subset of
the table's pages, but the real problem is the values are wrong even for
the pages we did look at. To wit:
* scanned_pages is not reliably the number of pages we scanned, because
somebody thought it would be cute to bump it even for pages we decided
didn't need to be scanned because they contain no freezable tuples.
So we have an increment in scanned_pages, but no corresponding increment
in the tuple count, leading to a density underestimate. This seems to
only happen in vacuum-for-wraparound cases, but it's still wrong. We need
to separate the logic about whether we skipped any pages from the
statistical counters.
* num_tuples has very little to do with the number of live tuples, because
it actually counts all nonremovable tuples, including RECENTLY_DEAD,
INSERT_IN_PROGRESS, and DELETE_IN_PROGRESS tuples. In the case Tim is
complaining about, the VACUUM happens concurrently with a long transaction
that is bulk-updating most tuples in the relation, some of them several
times, so that VACUUM sees multiple images of every tuple (one
INSERT_IN_PROGRESS, the rest DELETE_IN_PROGRESS), and thus arrives at a
horrid overestimate of the number of live tuples.
I figured it'd be easy enough to get a better estimate by adding another
counter to count just LIVE and INSERT_IN_PROGRESS tuples (thus effectively
assuming that in-progress inserts and deletes will both commit). I did
that, and found that it helped Tim's test case not at all :-(. A bit of
sleuthing revealed that HeapTupleSatisfiesVacuum actually returns
INSERT_IN_PROGRESS for any tuple whose xmin isn't committed, regardless of
whether the transaction has since marked it for deletion:
/*
* It'd be possible to discern between INSERT/DELETE in progress
* here by looking at xmax - but that doesn't seem beneficial for
* the majority of callers and even detrimental for some. We'd
* rather have callers look at/wait for xmin than xmax. It's
* always correct to return INSERT_IN_PROGRESS because that's
* what's happening from the view of other backends.
*/
return HEAPTUPLE_INSERT_IN_PROGRESS;
It did not use to blow this question off: back around 8.3 you got
DELETE_IN_PROGRESS if the tuple had a delete pending. I think we need
less laziness + fuzzy thinking here. Maybe we should have a separate
HEAPTUPLE_INSERT_AND_DELETE_IN_PROGRESS result code? Is it *really*
the case that callers other than VACUUM itself are okay with failing
to make this distinction? I'm dubious: there are very few if any
callers that treat the INSERT and DELETE cases exactly alike.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Josh Berkus | 2014-06-06 19:51:34 | Re: Why is it "JSQuery"? |
Previous Message | Jim Nasby | 2014-06-06 18:51:59 | Re: Proposing pg_hibernate |