Re: Hash Indexes

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash Indexes
Date: 2016-11-17 17:05:45
Message-ID: CAA4eK1JnYZA9g1RJQT8xmU=VtReKdM_yw8cW-LeYZpZXO2gctQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 17, 2016 at 3:08 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sat, Nov 12, 2016 at 12:49 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> You are right and I have changed the code as per your suggestion.
>
> So...
>
> + /*
> + * We always maintain the pin on bucket page for whole scan operation,
> + * so releasing the additional pin we have acquired here.
> + */
> + if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
> + _hash_dropbuf(rel, *bufp);
>
> This relies on the page contents to know whether we took a pin; that
> seems like a bad plan. We need to know intrinsically whether we took
> a pin.
>

Okay, I think we can do that as we have bucket buffer information
(hashso_bucket_buf) in HashScanOpaqueData. We might need to pass this
information in _hash_readprev.

> + * If the bucket split is in progress, then we need to skip tuples that
> + * are moved from old bucket. To ensure that vacuum doesn't clean any
> + * tuples from old or new buckets till this scan is in progress, maintain
> + * a pin on both of the buckets. Here, we have to be cautious about
>
> It wouldn't be a problem if VACUUM removed tuples from the new bucket,
> because they'd have to be dead anyway. It also wouldn't be a problem
> if it removed tuples from the old bucket that were actually dead. The
> real issue isn't vacuum anyway, but the process of cleaning up after a
> split. We need to hold the pin so that tuples being moved from the
> old bucket to the new bucket by the split don't get removed from the
> old bucket until our scan is done.
>

Are you expecting a comment change here?

> + old_blkno = _hash_get_oldblock_from_newbucket(rel,
> opaque->hasho_bucket);
>
> Couldn't you pass "bucket" here instead of "hasho->opaque_bucket"? I
> feel like I'm repeating this ad nauseum, but I really think it's bad
> to rely on the special space instead of our own local variables!
>

Sure, we can pass bucket as well. However, if you see few lines below
(while (BlockNumberIsValid(opaque->hasho_nextblkno))), we are already
relying on special space to pass variables. In general, we are using
special space to pass variables to functions in many other places in
the code. What exactly are you bothered about in accessing special
space, if it is safe to do?

> - /* we ran off the end of the bucket without finding a match */
> + /*
> + * We ran off the end of the bucket without finding a match.
> + * Release the pin on bucket buffers. Normally, such pins are
> + * released at end of scan, however scrolling cursors can
> + * reacquire the bucket lock and pin in the same scan multiple
> + * times.
> + */
> *bufP = so->hashso_curbuf = InvalidBuffer;
> ItemPointerSetInvalid(current);
> + _hash_dropscanbuf(rel, so);
>
> I think this comment is saying that we'll release the pin on the
> primary bucket page for now, and then reacquire it later if the user
> reverses the scan direction. But that doesn't sound very safe,
> because the bucket could be split in the meantime and the order in
> which tuples are returned could change. I think we want that to
> remain stable within a single query execution.
>

Isn't that possible even without the patch? Basically, after reaching
end of forward scan and for doing backward *all* scan, we need to
perform portal rewind which will in turn call hashrescan where we will
drop the lock on bucket and then again when we try to move cursor
forward we acquire lock in _hash_first(), so in between when we don't
have the lock, the split could happen and next scan results could
differ.

Also, in the documentation, it is mentioned that "The SQL standard
says that it is implementation-dependent whether cursors are sensitive
to concurrent updates of the underlying data by default. In
PostgreSQL, cursors are insensitive by default, and can be made
sensitive by specifying FOR UPDATE." which I think indicates that
results can't be guaranteed for forward and backward scans.

So, even if we try to come up with some solution for stable results in
some scenarios, I am not sure that can be guaranteed for all
scenarios.

> + /*
> + * setting hashso_skip_moved_tuples to false
> + * ensures that we don't check for tuples that are
> + * moved by split in old bucket and it also
> + * ensures that we won't retry to scan the old
> + * bucket once the scan for same is finished.
> + */
> + so->hashso_skip_moved_tuples = false;
>
> I think you've got a big problem here. Suppose the user starts the
> scan in the new bucket and runs it forward until they end up in the
> old bucket. Then they turn around and run the scan backward. When
> they reach the beginning of the old bucket, they're going to stop, not
> move back to the new bucket, AFAICS. Oops.
>

After the scan has finished old bucket and turned back, it will
actually restart the scan (_hash_first) and will start from the end of
the new bucket. That is also a problem and it should actually start
from the end of the old bucket which is actually what you have
mentioned as next problem. So, I think if we fix the next problem, we
are okay.

> _hash_first() has a related problem: a backward scan starts at the end
> of the new bucket and moves backward, but it should start at the end
> of the old bucket, and then when it reaches the beginning, flip to the
> new bucket and move backward through that one. Otherwise, a backward
> scan and a forward scan don't return tuples in opposite order, which
> they should.
>
> I think what you need to do to fix both of these problems is a more
> thorough job gluing the two buckets together. I'd suggest that the
> responsibility for switching between the two buckets should probably
> be given to _hash_readprev() and _hash_readnext(), because every place
> that needs to advance to the next or previous page that cares about
> this. Right now you are trying to handle it mostly in the functions
> that call those functions, but that is prone to errors of omission.
>

It seems like a better way, so will change accordingly.

> Also, I think that so->hashso_skip_moved_tuples is badly designed.
> There are two separate facts you need to know: (1) whether you are
> scanning a bucket that was still being populated at the start of your
> scan and (2) if yes, whether you are scanning the bucket being
> populated or whether you are instead scanning the corresponding "old"
> bucket. You're trying to keep track of that using one Boolean, but
> one Boolean only has two states and there are three possible states
> here.
>

So do you prefer to have two booleans to track those facts or have an
uint8 with a tri-state value or something else?

>
> acquire cleanup lock on primary bucket page
> loop:
> scan and remove tuples
> if this is the last bucket page, break out of loop
> pin and x-lock next page
> release prior lock and pin (except keep pin on primary bucket page)
> if the page we have locked is not the primary bucket page:
> release lock and take exclusive lock on primary bucket page
> if there are no other pins on the primary bucket page:
> squeeze the bucket to remove free space
>
> Come to think of it, I'm a little worried about the locking in
> _hash_squeezebucket(). It seems like we drop the lock on each "write"
> bucket page before taking the lock on the next one. So a concurrent
> scan could get ahead of the cleanup process. That would be bad,
> wouldn't it?
>

Yeah, that would be bad if it happens, but no concurrent scan can
happen during squeeze phase because we take an exclusive lock on a
bucket page and maintain it throughout the operation.

Thanks for such a detailed review.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-11-17 17:24:17 Re: Hash Indexes
Previous Message Robert Haas 2016-11-17 16:43:14 Re: Declarative partitioning - another take