Re: Next Steps with Hash Indexes

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Next Steps with Hash Indexes
Date: 2021-08-12 04:19:18
Message-ID: CAA4eK1+xft7++BoyiFVbT70pjzYxPvvZfKdk+mfej6tV9sRR-Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 12, 2021 at 9:09 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Wed, Aug 11, 2021 at 8:47 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> > As far as the specific point at hand is concerned, I think storing
> > a hash value per index column, while using only the first column's
> > hash for bucket selection, is what to do for multicol indexes.
> > We still couldn't set amoptionalkey=true for hash indexes, because
> > without a hash for the first column we don't know which bucket to
> > look in. But storing hashes for the additional columns would allow
> > us to check additional conditions in the index, and usually save
> > trips to the heap on queries that provide additional column
> > conditions.

Yeah, this sounds reasonable but I think the alternative proposal by
Dilip (see below) and me [1] also has merits.

> > You could also imagine sorting the contents of a bucket
> > on all the hashes, which would ease uniqueness checks.
>

Yeah, we can do that but the current design also seems to have merits
for uniqueness checks. For sorting all the hashes in the bucket, we
need to read all the overflow pages and then do sort, which could lead
to additional I/O in some cases. The other possibility is to keep all
the bucket pages sorted during insertion but that would also require
additional I/O. OTOH, in the current design, if the value is not found
in the current bucket page (which has hash values in sorted order),
only then we move to the next page.

> Earlier, I was thinking that we have two hashes, one for the first key
> column that is for identifying the bucket, and one for the remaining
> key columns which will further help with heap lookup and ordering for
> uniqueness checking.
>

I have also mentioned an almost similar idea yesterday [1]. If we go
with a specification similar to MySQL and SQLServer then probably it
would be better than storing the hashes for all the columns.

But yeah if we have a hash value for each column
> then it will make it really flexible.
>
> I was also looking into other databases that how they support hash
> indexes, then I see at least in MySQL[1] the multiple column index has
> a limitation that you have to give all key columns in search for
> selecting the index scan.

I see that SQLServer also has the same specification for multi-column
hash index [2]. See the "Multi-column index" section. So it might not
be a bad idea to have a similar specification.

[1] - https://www.postgresql.org/message-id/CAA4eK1JD1%3DnPDi0kDPGLC%2BJDGEYP8DgTanobvgve%2B%2BKniQ68TA%40mail.gmail.com
[2] - https://docs.microsoft.com/en-us/sql/relational-databases/in-memory-oltp/indexes-for-memory-optimized-tables?view=sql-server-ver15

--
With Regards,
Amit Kapila.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Nancarrow 2021-08-12 04:19:36 Re: Small documentation improvement for ALTER SUBSCRIPTION
Previous Message Dilip Kumar 2021-08-12 03:39:31 Re: Next Steps with Hash Indexes