Re: Next Steps with Hash Indexes

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>
Cc: PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Next Steps with Hash Indexes
Date: 2021-07-20 12:00:29
Message-ID: CAA4eK1K9mm_pV6i3MBB-4ZaPi75u2PZvcGoOnF+UYKWD3Pgitw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 15, 2021 at 10:11 PM Simon Riggs
<simon(dot)riggs(at)enterprisedb(dot)com> wrote:
>
> 2. Unique Hash Indexes have been summarized here:
> https://www.postgresql.org/message-id/CAA4eK1KATC1TA5bR5eobYQVO3RWsnH6djNpk3P376em4V8MuUA%40mail.gmail.com
> which also seems to have two parts to it.
>
> 2.1 Uniqueness Check
> Amit: "to ensure that there is no duplicate entry we need to traverse
> the whole bucket chain"
> Agreed. That seems straightforward and can also be improved later.
>
> 2.2 Locking
> Amit's idea of holding ExclusiveLock on the bucket page works for me,
> but there was some doubt about splitting.
>

I think the main thing to think about for uniqueness check during
split (where we scan both the old and new buckets) was whether we need
to lock both the old (bucket_being_split) and new
(bucket_being_populated) buckets or just holding locks on one of them
(the current bucket in which we are inserting) is sufficient? During a
scan of the new bucket, we just retain pins on both the buckets (see
comments in _hash_first()) but if we need to retain locks on both
buckets then we need to do something different then we do it for
scans. But, I think it is sufficient to just hold an exclusive lock on
the primary bucket page in the bucket we are trying to insert and pin
on the other bucket (old bucket as we do for scans). Because no
concurrent inserter should try to insert into the old bucket and new
bucket the same tuple as before starting the split we always update
the metapage for hashm_lowmask and hashm_highmask which decides the
routing of the tuples.

Now, I think here the other problem we need to think about is that for
the hash index after finding the tuple in the index, we need to always
recheck in the heap as we don't store the actual value in the hash
index. For that in the scan, we get the tuple(s) from the index
(release locks) and then match qual after fetching tuple from the
heap. But we can't do that for uniqueness check because if we release
the locks on the index bucket page then another inserter could come
before we match it in heap. I think we need some mechanism that after
fetching TID from the index, we recheck the actual value in heap
before releasing the lock on the index bucket page.

The other thing could be that if we have unique support for hash index
then probably we can allow Insert ... ON Conflict if the user
specifies unique index column as conflict_target.

I am not sure if multicol index support is mandatory to allow
uniqueness for hash indexes, sure it would be good but I feel that can
be done as a separate patch as well.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2021-07-20 12:07:58 Re: row filtering for logical replication
Previous Message Peter Eisentraut 2021-07-20 11:57:16 Re: improvements in Unicode tables generation code