From: | Kenneth Marshall <ktm(at)rice(dot)edu> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Xiao Meng <mx(dot)cogito(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: [PATCH]-hash index improving |
Date: | 2008-07-18 17:23:14 |
Message-ID: | 20080718172312.GR337@it.is.rice.edu |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
I just ran my original 16M word test case against the patched
version, and like Tom noted below, the tuples per bucket
calculation is wrong which results in identical index sizes
for both the original version and the hash-value-only version.
> I suppose that the main point of #1 is to reduce index size by
On Fri, Jul 18, 2008 at 12:23:25PM -0400, Tom Lane wrote:
> "Jonah H. Harris" <jonah(dot)harris(at)gmail(dot)com> writes:
> > Agreed. My thinking is that there's either something inherently wrong
> > with the implementation, or we're performing so many disk I/Os that
> > it's nearly equivalent to b-tree. Tom has a couple suggestions which
> > Xiao and I will explore.
>
> I finally got a chance to look through the patch in some detail.
> If I haven't missed something, there are just two improvement ideas
> embodied in it:
>
> 1. Store just the hash value, and not the index key value, in hash
> index entries. (This means that all hash indexscans become lossy
> and have to be rechecked at the heap.)
>
> 2. Keep the contents of each index page ordered by hash value, and use
> binary instead of linear search to find the matching item(s) during an
> indexscan. (This depends on #1 because recomputing the hash values
> during the binary search would be too expensive --- although you could
> also make it work by storing *both* the hash value and the original
> key.)
>
> allowing more tuples to fit in each bucket. However, the patch
> neglects to adjust the target-fillfactor calculation in _hash_metapinit,
> which means that the code won't actually put any more tuples per bucket
> (on average) than it did before. So the index isn't actually any
> smaller and you get no I/O savings --- you just have more unused
> space on a typical page. Fixing that might help.
>
> FWIW, I had always assumed that part of the solution to hash's woes
> would involve decoupling the bucket size from the page size, so
> that you could have multiple buckets per page. But maybe the
> binary-search idea makes that unnecessary. I'm not sure. A whole
> lot depends on how evenly the buckets get filled. You should probably
> investigate how many tuples actually end up in each bucket with and
> without the patch.
>
I think that while the binary-search idea will improve the lookup
over the original sequential scan of the bucket, it makes updates
much more expensive particularly with buckets approaching 100%
full. The idea that I have been mulling over tries to improve access
times by breaking a bucket in mini-virtual buckets within a page. We
restrict the size of the mini-bucket to be pagesize/(1/2^n). The
sweet spot should be around n=6 or 7 which for an 8k pagesize yields
a mini-bucket size of 32 or 64 bytes. Then the search for the value
in a page is to read the virtual bucket corresponding to the n bits
of the hash value.
The second piece is to take advantage of the fact that the size of
the mini-bucket is not an even multiple of the size of a hash index
tuple and aggregate all the lost space for use as the "first" overflow
page for all of a pages mini-buckets. This avoids the I/O needed to
read a full overflow page from disk and accomodates the imperfections
in the hash function distribution. The overflow pages, both the virtual
"first" and subsequent real pages would benefit from the binary lookups.
It may also be worth storing the high and low hash values specially to
avoid the search in a page if its value would not be on the page.
> In the realm of micro-optimizations that might be significant, I think
> you really need to get rid of all those _create_hash_desc calls,
> particularly the one in _hash_checkqual which is part of the inner loop
> of an indexscan. Not only are they slow but they probably represent a
> serious memory leak in a scan that returns many tuples. For reading the
> hash value out of an existing index tuple, I don't think you should be
> bothering with a tupdesc at all --- don't use index_getattr, just map a
> C struct onto the known layout of a indextuple with a single never-null
> int field. This would particularly make for a noticeable improvement in
> the speed of _hash_binsearch. The tupdescs used in storing an index
> entry are probably needed, but you could just use a single statically
> declared tupdesc (look at the ones in relcache.c for examples of
> building one as a constant).
>
+1
> regards, tom lane
>
I think that this sort of virtual bucket would allow us to take
better advantage of the O(1) behavior. What do you all think?
Regards,
Ken
From | Date | Subject | |
---|---|---|---|
Next Message | Erik | 2008-07-18 17:30:24 | Re: [PATCHES] WITH RECUSIVE patches 0717 |
Previous Message | Bruce Momjian | 2008-07-18 17:20:16 | Re: .psqlrc output for \pset commands |