Re: Unused expression indexes

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Maciek Sakrejda <maciek(at)pganalyze(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Unused expression indexes
Date: 2024-08-06 21:20:25
Message-ID: 015a0c94-7435-4c10-a2f3-0b5d87c81301@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 8/6/24 22:25, Tom Lane wrote:
> Maciek Sakrejda <maciek(at)pganalyze(dot)com> writes:
>> In a blog post [1], Bruce Momjian notes that expression indexes can
>> help with planning even if they're not used directly. But the examples
>> cited in that post are vague (i.e., they improve stats, but it's not
>> clear how they could change plans), and Bruce's answer to a comment
>> [2] suggests that this is not documented.
>
>> Is there any more info on this mechanism? Specifically, if one has
>> unused expression indexes (according to pg_stat_user_indexes), is it
>> safe to drop them? Or could they be providing statistics that
>> materially affect query planning even though the indexes themselves
>> are unused?
>
> Expression indexes definitely can affect planning, because ANALYZE
> collects stats on the values of those expressions. As a trivial
> example,
>
> regression=# create table foo (x1 float8);
> CREATE TABLE
> regression=# insert into foo select 10 * random() from generate_series(1,10000);
> INSERT 0 10000
> regression=# analyze foo;
> ANALYZE
> regression=# explain analyze select * from foo where sqrt(x1) < 1;
> QUERY PLAN
> -----------------------------------------------------------------------------------------------------
> Seq Scan on foo (cost=0.00..195.00 rows=3333 width=8) (actual time=0.009..0.546 rows=1028 loops=1)
> Filter: (sqrt(x1) < '1'::double precision)
> Rows Removed by Filter: 8972
> Planning Time: 0.065 ms
> Execution Time: 0.572 ms
> (5 rows)
>
> The planner has no info about the values of sqrt(x1), so you get a
> default estimate (one-third) of the selectivity of the WHERE clause.
> But watch this:
>
> regression=# create index on foo (sqrt(x1));
> CREATE INDEX
> regression=# analyze foo;
> ANALYZE
> regression=# explain analyze select * from foo where sqrt(x1) < 1;
> QUERY PLAN
> ----------------------------------------------------------------------------------------------------------------------------
> Bitmap Heap Scan on foo (cost=24.24..84.63 rows=1026 width=8) (actual time=0.078..0.229 rows=1028 loops=1)
> Recheck Cond: (sqrt(x1) < '1'::double precision)
> Heap Blocks: exact=45
> -> Bitmap Index Scan on foo_sqrt_idx (cost=0.00..23.98 rows=1026 width=0) (actual time=0.068..0.068 rows=1028 loops=1)
> Index Cond: (sqrt(x1) < '1'::double precision)
> Planning Time: 0.113 ms
> Execution Time: 0.259 ms
> (7 rows)
>
> Now there are stats about the values of sqrt(x1), allowing a far more
> accurate selectivity estimate to be made. In this particular example
> there's no change of plan (it would have used the index anyway), but
> certainly a different rowcount estimate can make a big difference.
>
> This mechanism is quite ancient, and in principle it's now superseded
> by extended statistics. For example, I can drop this index and
> instead do
>
> regression=# drop index foo_sqrt_idx;
> DROP INDEX
> regression=# create statistics foostats on sqrt(x1) from foo;
> CREATE STATISTICS
> regression=# analyze foo;
> ANALYZE
> regression=# explain analyze select * from foo where sqrt(x1) < 1;
> QUERY PLAN
> -----------------------------------------------------------------------------------------------------
> Seq Scan on foo (cost=0.00..195.00 rows=1026 width=8) (actual time=0.006..0.479 rows=1028 loops=1)
> Filter: (sqrt(x1) < '1'::double precision)
> Rows Removed by Filter: 8972
> Planning Time: 0.079 ms
> Execution Time: 0.503 ms
> (5 rows)
>
> So the accurate rowcount estimate is still obtained in this example;
> and we're not incurring any index maintenance costs, only ANALYZE
> costs that are going to be roughly the same either way.
>
> However, I am not certain that extended statistics are plugged into
> all the places where the older mechanism applies. Tomas Vondra might
> have a better idea than I of where gaps remain in that.
>

AFAIK it handles all / exactly the same cases. The magic happens in
examine_variable() in selfuncs.c, where we look for stats for simple
Vars, and then for stats for expressions. First we look at all indexes,
and then (if there's no suitable index) at all extended stats.

There might be a place doing something ad hoc, but I can't think of any.

regards

--
Tomas Vondra

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2024-08-06 21:20:37 Re: Inconsistency with EXPLAIN ANALYZE CREATE MATERIALIZED VIEW
Previous Message Tom Lane 2024-08-06 21:03:54 Re: Unused expression indexes