From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Hash Indexes |
Date: | 2016-11-08 19:53:58 |
Message-ID: | CA+TgmoYki1rP6ohsrgLhWJyrmtpMSc+vCpUFBAUxrq1oy_gz=Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Nov 7, 2016 at 9:51 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> [ new patches ]
Attached is yet another incremental patch with some suggested changes.
+ * This expects that the caller has acquired a cleanup lock on the target
+ * bucket (primary page of a bucket) and it is reponsibility of caller to
+ * release that lock.
This is confusing, because it makes it sound like we retain the lock
through the entire execution of the function, which isn't always true.
I would say that caller must acquire a cleanup lock on the target
primary bucket page before calling this function, and that on return
that page will again be write-locked. However, the lock might be
temporarily released in the meantime, which visiting overflow pages.
(Attached patch has a suggested rewrite.)
+ * During scan of overflow pages, first we need to lock the next bucket and
+ * then release the lock on current bucket. This ensures that any concurrent
+ * scan started after we start cleaning the bucket will always be behind the
+ * cleanup. Allowing scans to cross vacuum will allow it to remove tuples
+ * required for sanctity of scan.
This comment says that it's bad if other scans can pass our cleanup
scan, but it doesn't explain why. I think it's because we don't have
page-at-a-time mode yet, and cleanup might decrease the TIDs for
existing index entries. (Attached patch has a suggested rewrite, but
might need further adjustment if my understanding of the reasons is
incomplete.)
+ if (delay)
+ vacuum_delay_point();
You don't really need "delay". If we're not in a cost-accounted
VACUUM, vacuum_delay_point() just turns into CHECK_FOR_INTERRUPTS(),
which should be safe (and a good idea) regardless. (Fixed in
attached.)
+ if (callback && callback(htup, callback_state))
+ {
+ /* mark the item for deletion */
+ deletable[ndeletable++] = offno;
+ if (tuples_removed)
+ *tuples_removed += 1;
+ }
+ else if (bucket_has_garbage)
+ {
+ /* delete the tuples that are moved by split. */
+ bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup
),
+ maxbucket,
+ highmask,
+ lowmask);
+ /* mark the item for deletion */
+ if (bucket != cur_bucket)
+ {
+ /*
+ * We expect tuples to either belong to curent bucket or
+ * new_bucket. This is ensured because we don't allow
+ * further splits from bucket that contains garbage. See
+ * comments in _hash_expandtable.
+ */
+ Assert(bucket == new_bucket);
+ deletable[ndeletable++] = offno;
+ }
+ else if (num_index_tuples)
+ *num_index_tuples += 1;
+ }
+ else if (num_index_tuples)
+ *num_index_tuples += 1;
+ }
OK, a couple things here. First, it seems like we could also delete
any tuples where ItemIdIsDead, and that seems worth doing. In fact, I
think we should check it prior to invoking the callback, because it's
probably quite substantially cheaper than the callback. Second,
repeating deletable[ndeletable++] = offno and *num_index_tuples += 1
doesn't seem very clean to me; I think we should introduce a new bool
to track whether we're keeping the tuple or killing it, and then use
that to drive which of those things we do. (Fixed in attached.)
+ if (H_HAS_GARBAGE(bucket_opaque) &&
+ !H_INCOMPLETE_SPLIT(bucket_opaque))
This is the only place in the entire patch that use
H_INCOMPLETE_SPLIT(), and I'm wondering if that's really correct even
here. Don't you really want H_OLD_INCOMPLETE_SPLIT() here? (And
couldn't we then remove H_INCOMPLETE_SPLIT() itself?) There's no
garbage to be removed from the "new" bucket until the next split, when
it will take on the role of the "old" bucket.
I think it would be a good idea to change this so that
LH_BUCKET_PAGE_HAS_GARBAGE doesn't get set until
LH_BUCKET_OLD_PAGE_SPLIT is cleared. The current way is confusing,
because those tuples are NOT garbage until the split is completed!
Moreover, both of the places that care about
LH_BUCKET_PAGE_HAS_GARBAGE need to make sure that
LH_BUCKET_OLD_PAGE_SPLIT is clear before they do anything about
LH_BUCKET_PAGE_HAS_GARBAGE, so the change I am proposing would
actually simplify the code very slightly.
+#define H_OLD_INCOMPLETE_SPLIT(opaque) ((opaque)->hasho_flag &
LH_BUCKET_OLD_PAGE_SPLIT)
+#define H_NEW_INCOMPLETE_SPLIT(opaque) ((opaque)->hasho_flag &
LH_BUCKET_NEW_PAGE_SPLIT)
The code isn't consistent about the use of these macros, or at least
not in a good way. When you care about LH_BUCKET_OLD_PAGE_SPLIT, you
test it using the macro; when you care about H_NEW_INCOMPLETE_SPLIT,
you ignore the macro and test it directly. Either get rid of both
macros and always test directly, or keep both macros and use both of
them. Using a macro for one but not the other is strange.
I wonder if we should rename these flags and macros. Maybe
LH_BUCKET_OLD_PAGE_SPLIT -> LH_BEING_SPLIT and
LH_BUCKET_NEW_PAGE_SPLIT -> LH_BEING_POPULATED. I think that might be
clearer. When LH_BEING_POPULATED is set, the bucket is being filled -
that is, populated - from the old bucket. And maybe
LH_BUCKET_PAGE_HAS_GARBAGE -> LH_NEEDS_SPLIT_CLEANUP, too.
+ * Copy bucket mapping info now; The comment in _hash_expandtable
+ * where we copy this information and calls _hash_splitbucket explains
+ * why this is OK.
After a semicolon, the next word should not be capitalized. There
shouldn't be two spaces after a semicolon, either. Also,
_hash_splitbucket appears to have a verb before it (calls) and a verb
after it (explains) and I have no idea what that means.
+ for (;;)
+ {
+ mask = lowmask + 1;
+ new_bucket = old_bucket | mask;
+ if (new_bucket > metap->hashm_maxbucket)
+ {
+ lowmask = lowmask >> 1;
+ continue;
+ }
+ blkno = BUCKET_TO_BLKNO(metap, new_bucket);
+ break;
+ }
I can't help feeling that it should be possible to do this without
looping. Can we ever loop more than once? How? Can we just use an
if-then instead of a for-loop?
Can't _hash_get_oldbucket_newblock call _hash_get_oldbucket_newbucket
instead of duplicating the logic?
I still don't like the names of these functions very much. If you
said "get X from Y", it would be clear that you put in Y and you get
out X. If you say "X 2 Y", it would be clear that you put in X and
you get out Y. As it is, it's not very clear which is the input and
which is the output.
+ bool primary_buc_page)
I think we could just go with "primary_page" here. (Fixed in attached.)
+ /*
+ * Acquiring cleanup lock to clear the split-in-progress flag ensures that
+ * there is no pending scan that has seen the flag after it is cleared.
+ */
+ _hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
+ opage = BufferGetPage(bucket_obuf);
+ oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
I don't understand the comment, because the code *isn't* acquiring a
cleanup lock.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment | Content-Type | Size |
---|---|---|
more-hash-tweaks.patch | application/x-download | 7.8 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2016-11-08 20:35:54 | Re: Typo in event_trigger.c |
Previous Message | Peter Eisentraut | 2016-11-08 19:37:17 | Re: WIP: About CMake v2 |