From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Joshua Yanovski <pythonesque(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Proposal: COUNT(*) (and related) speedup |
Date: | 2014-04-08 14:59:23 |
Message-ID: | CA+TgmoY-wi6oYkd6uxG1Mb0e1=Gan7_mGu4PRMTwG0cgw2ni2g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Apr 4, 2014 at 1:19 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Joshua Yanovski <pythonesque(at)gmail(dot)com> writes:
>>> But worse, what happens if a count(*)
>>> is in progress? It might or might not have scanned this page already,
>>> and there's no way to get the right answer in both cases. Counter
>>> updates done by VACUUM would have a similar race-condition problem.
>
>> I don't think the first part really matters. If the page was visible
>> when COUNT(*) started and then got dirtied by some other transaction,
>> any changes by the second transaction shouldn't alter the actual count
>> in our transaction anyway, so whether we scan the page needlessly or
>> not seems beside the point.
>
> The question is not whether you scan a page "needlessly" or not, it's
> whether you double-count the tuples on it. I don't think it's possible to
> be sure that when you add the central counter value into your local sum,
> you are neither double-counting nor omitting pages whose all-visible state
> changed while you were scanning the table.
I think it would work if you also had information about which pages
you needed to scan. For example, suppose you had a data structure
sorta like the visibility map, except that instead of one bit per page
you have one 16-bit integer per page. The integer is -1 if the page
isn't known to be all-visible, and is otherwise the count of tuples on
the page. Now, we can do a count(*) as follows:
1. Lock the first as-yet-unexamined page of the data structure.
2. Add all the values that aren't -1 to your running count of tuples,
and remember the page numbers corresponding to the remaining entries.
3. Unlock the page you locked in step 1.
4. Visit each heap page you remembered in step 2 and add the number of
tuples visible to your scan to your running count of tuples.
5. Provided there's a page of the data structure you haven't visited
yet, go to step 1.
On the write side, any operation that clears the visibility map bit
must also set the entry for that page to -1. When the page is
all-visible, the value can be set to the total number of tuples on the
page. This is safe because, even if the page ceases to be all-visible
after we add its tuples to the count, the new tuples that were added
aren't visible to our scan, and any tuples deleted are still visible
to our scan - because the transaction making the changes, in either
case, is obviously still running, and therefore didn't commit before
we took our snapshot.
Now, this wouldn't be O(1) but it would presumably be quite fast if
your visibility map bits are mostly set. One 8kB page of 16-bit
counters would cover 32MB of heap.
The bad news is that I am pretty sure there's no easy way for an index
AM to get callbacks at the right time, so it's really unclear how
something like this would fit in with our existing abstractions. And
even if we answered that question, it would be a lot of work for a
pretty narrow use case.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2014-04-08 15:01:41 | Re: Fwd: Proposal: variant of regclass |
Previous Message | Tom Lane | 2014-04-08 14:50:35 | Re: Fwd: Proposal: variant of regclass |