Re: A design for amcheck heapam verification

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A design for amcheck heapam verification
Date: 2017-08-30 01:00:59
Message-ID: CAH2-Wznr0ykpyexDDcDgUbohCJzq1UJ-Asa4aGH1eZt89GGRjw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 29, 2017 at 4:34 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> Some drive-by comments on the lib patch:

I was hoping that you'd look at this, since you'll probably want to
use a bloom filter for parallel hash join at some point. I've tried to
keep this one as simple as possible. I think that there is a good
chance that it will be usable for parallel hash join with multiple
batches. You'll need to update the interface a little bit to make that
work (e.g., bring-your-own-hash interface), but hopefully the changes
you'll need will be limited to a few small details.

> +bloom_add_element(bloom_filter *filter, unsigned char *elem, Size len)
>
> I think the plan is to use size_t for new stuff[1].

I'd forgotten.

> This is another my_log2(), right?
>
> It'd be nice to replace both with fls() or flsl(), though it's
> annoying to have to think about long vs int64 etc. We already use
> fls() in two places and supply an implementation in src/port/fls.c for
> platforms that lack it (Windows?), but not the long version.

win64 longs are only 32-bits, so my_log2() would do the wrong thing
for me on that platform. pow2_truncate() is provided with a number of
bits as its argument, not a number of bytes (otherwise this would
work).

Ideally, we'd never use long integers, because its width is platform
dependent, and yet it is only ever used as an alternative to int
because it is wider than int. One example of where this causes
trouble: logtape.c uses long ints, so external sorts can use half the
temp space on win64.

> +/*
> + * Hash function is taken from sdbm, a public-domain reimplementation of the
> + * ndbm database library.
> + */
> +static uint32
> +sdbmhash(unsigned char *elem, Size len)
> +{

> I see that this is used in gawk, BerkeleyDB and all over the place[2].
> Nice. I understand that this point of this is to be a hash function
> that is different from our usual one, for use by k_hashes().

Right. It's only job is to be a credible hash function that isn't
derivative of hash_any().

> Do you
> think it belongs somewhere more common than this? It seems a bit like
> our hash-related code is scattered all over the place but should be
> consolidated, but I suppose that's a separate project.

Unsure. In its defense, there is also a private murmurhash one-liner
within tidbitmap.c. I don't mind changing this, but it's slightly odd
to expose a hash function whose only job is to be completely unrelated
to hash_any().

> Unnecessary braces here and elsewhere for single line body of for loops.
>
> +bloom_prop_bits_set(bloom_filter *filter)
> +{
> + int bitset_bytes = NBITS(filter) / BITS_PER_BYTE;
> + int64 bits_set = 0;
> + int i;
> +
> + for (i = 0; i < bitset_bytes; i++)
> + {
> + unsigned char byte = filter->bitset[i];
> +
> + while (byte)
> + {
> + bits_set++;
> + byte &= (byte - 1);
> + }
> + }

I don't follow what you mean here.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2017-08-30 01:06:51 Re: segment size depending *_wal_size defaults (was increasing the default WAL segment size)
Previous Message Michael Paquier 2017-08-30 00:49:14 Re: segment size depending *_wal_size defaults (was increasing the default WAL segment size)