Re: Feature request for adoptive indexes

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Hayk Manukyan <manukyantt(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Feature request for adoptive indexes
Date: 2021-11-02 14:03:06
Message-ID: aef50e7e-55c1-2467-f7b0-3a2f90420a10@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/2/21 13:04, Hayk Manukyan wrote:
> Tomas Vondra
> > Are you suggesting those are not the actual best/worst cases and we
> > should use some other indexes? If yes, which ones?
>
> I would say yes.
> In my case I am not querying only sequence column.
> I have the following cases which I want to optimize.
> 1. Select * from Some_table where job = <somthing> and nlp = <something>
> and year = <something> and *scan_id = <something> *
> 2. Select * from Some_table where job = <somthing> and nlp = <something>
> and year = <something> and *Issue_flag = <something> *
> 3. Select * from Some_table where job = <somthing> and nlp = <something>
> and year = <something> and *sequence = <something> *
> Those are queries that my app send to db that is why I said that from
> *read perspective* our *best case* is 3 separate indexes for
> *(job, nlp, year, sequence)* AND *(job, nlp, year, Scan_ID)* and *(job,
> nlp, year,  issue_flag)*  and any other solution like
>  (job, nlp, year, sequence, Scan_ID, issue_flag) OR  (job, nlp, year )
> INCLUDE(sequence, Scan_ID, issue_flag)  OR just (job, nlp, year) can be
> considered as*worst case *

I already explained why using INCLUDE in this case is the wrong thing to
do, it'll harm performance compared to just defining a regular index.

> I will remind that in real world scenario I have ~50m rows and about
> *~5k rows for each (job, nlp, year )*

Well, maybe this is the problem. We have 50M rows, but the three columns
have too many distinct values - (job, nlp, year) defines ~50M groups, so
there's only a single row per group. That'd explain why the two indexes
perform almost equally.

So I guess you need to modify the data generator so that the data set is
more like the case you're trying to improve.

> From *write perspective* as far as we want to have only one index
> our*best case* can be considered any of
> *(job, nlp, year, sequence, Scan_ID, issue_flag)* OR *(job, nlp, year )
> INCLUDE(sequence, Scan_ID, issue_flag) *
> and the*worst case* will be having 3 separate queries like in read
> perspective
> (job, nlp, year, sequence) AND (job, nlp, year, Scan_ID) and (job, nlp,
> year,  issue_flag)
>

Maybe. It's true a write with three indexes will require modification to
three leaf pages (on average). With a single index we have to modify
just one leaf page, depending on where the row gets routed.

But with the proposed "merged" index, the row will have to be inserted
into three smaller trees. If the trees are large enough, they won't fit
into a single leaf page (e.g. the 5000 index tuples is guaranteed to
need many pages, even if you use some smart encoding). So the write will
likely need to modify at least 3 leaf pages, getting much closer to
three separate indexes. At which point you could just use three indexes.

> So I think the comparison that we did is not right because we are
> comparing different/wrong things.
> > For right results we need to compare this two cases when we are doing
> random queries with 1,2,3  and random writes.
>

I'm not going to spend any more time on tweaking the benchmark, but if
you tweak it to demonstrate the difference / benefits I'll run it again
on my machine etc.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Borisov 2021-11-02 14:04:14 Re: Feature request for adoptive indexes
Previous Message Nitin Jadhav 2021-11-02 13:35:48 Re: Multi-Column List Partitioning