Quick Links

Re: Correlation in cost_index()

From:	Manfred Koizar <mkoi-pg(at)aon(dot)at>
To:	Sean Chittenden <sean(at)chittenden(dot)org>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Correlation in cost_index()
Date:	2003-08-08 21:23:04
Message-ID:	ruu7jvo5vnjl064rm327ndqml0pmvgdsfa@4ax.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, 8 Aug 2003 11:06:56 -0700, Sean Chittenden
<sean(at)chittenden(dot)org> wrote:
>[...] it'd seem as though an avg depth of
>nodes in index * tuples_fetched * (random_io_cost * indexCorrelation)
>would be closer than where we are now...

Index depth does not belong here because we walk down the index only
once per index scan not once per tuple. It might be part of the
startup cost.

The rest of your formula doesn't seem right, too, because you get
higher costs for better correlation. Did you mean
random_io_cost * (1 - abs(indexCorrelation))?

FWIW, for small effective_cache_size max_IO_cost is almost equal to
tuples_fetched * random_page_cost. So your formula (with the
corrections presumed above) boils down to ignoring
effective_cache_size and linear interpolation between 0 and
max_IO_cost.

>It's very possible that cost_index() is wrong, but it seems as though
>after some testing as if PostgreSQL _overly_ _favors_ the use of
>indexes:

Was this an unpatched backend? What were the values of
effective_cache_size and random_page_cost?

># SET enable_seqscan = true; SET enable_indexscan = true;
>SET
>SET
># EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE utc_date > '2002-10-01'::TIMESTAMP WITH TIME ZONE;
>INFO: cost_seqscan: run_cost: 21472.687500
> startup_cost: 0.000000
>
>INFO: cost_index: run_cost: 21154.308116
> startup_cost: 0.000000
> indexCorrelation: 0.999729
> QUERY PLAN
>-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Index Scan using report_user_cat_count_utc_date_id_idx on report_user_cat_count rucc (cost=0.00..21154.31 rows=705954 width=64) (actual time=91.36..6625.79 rows=704840 loops=1)
> Index Cond: (utc_date > '2002-10-01 00:00:00-07'::timestamp with time zone)
> Total runtime: 11292.68 msec
>(3 rows)

"actual time=91.36..6625.79" but "Total runtime: 11292.68 msec"!
Where did those 4.7 seconds go?

># SET enable_seqscan = true; SET enable_indexscan = false;
>SET
>SET
># EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE utc_date > '2002-10-01'::TIMESTAMP WITH TIME ZONE;
>INFO: cost_seqscan: run_cost: 21472.687500
> startup_cost: 0.000000
>
>INFO: cost_index: run_cost: 21154.308116
> startup_cost: 100000000.000000
> indexCorrelation: 0.999729
> QUERY PLAN
>---------------------------------------------------------------------------------------------------------------------------------------
> Seq Scan on report_user_cat_count rucc (cost=0.00..21472.69 rows=705954 width=64) (actual time=1091.45..7441.19 rows=704840 loops=1)
> Filter: (utc_date > '2002-10-01 00:00:00-07'::timestamp with time zone)
> Total runtime: 10506.44 msec
>(3 rows)

Same here: "actual time=1091.45..7441.19" but "Total runtime: 10506.44
msec" - more than 3 seconds lost.

When we ignore total runtime and look at actual time we get

seq idx
estimated 21473 21154
actual 7441 6626

This doesn't look too bad, IMHO.

BTW, I believe that with your example (single-column index, almost
perfect correlation, index cond selects almost all tuples) all
interpolation methods give an index cost estimation that is very close
to seq scan cost, and the actual runtimes show that this is correct.

>Which I find surprising and humorous given the popular belief is, mine
>included, contrary to those results.

How many tuples are in report_user_cat_count? What are the stats for
report_user_cat_count.utc_date?

> I can say with pretty high
>confidence that the patch to use a geometric mean isn't correct after
>having done real world testing as its break even point is vastly
>incorrect and only uses an index when there are less than 9,000 rows
>to fetch, a far cry from the 490K break even I found while testing.

Could you elaborate, please. The intention of my patch was to favour
index scans more than the current implementation. If it does not, you
have found a bug in my patch. Did you test the other interpolation
methods?

>What I did find interesting, however, was that it does work better at
>determining the use of multi-column indexes,

Yes, because it computes the correlation for a two-column-index as
correlation_of_first_index_column * 0.95
instead of
correlation_of_first_index_column / 2

> but I think that's
>because the geometric mean pessimizes the value of indexCorrelation,
>which gets pretty skewed when using a multi-column index.

I don't understand this.

># CREATE INDEX report_user_cat_count_utc_date_user_id_idx ON report_user_cat_count (user_id,utc_date);
># CLUSTER report_user_cat_count_utc_date_user_id_idx ON report_user_cat_count;
># ANALYZE report_user_cat_count;
># SET enable_seqscan = true; SET enable_indexscan = true;
>SET
>SET
># EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE user_id < 1000 AND utc_date > '2002-01-01'::TIMESTAMP WITH TIME ZONE;
>INFO: cost_seqscan: run_cost: 23685.025000
> startup_cost: 0.000000
>
>INFO: cost_index: run_cost: 366295.018684
> startup_cost: 0.000000
> indexCorrelation: 0.500000
^^^
This is certainly not with my patch. The current implementation gives
ca. 366000 for pages_fetched = 122000 and random_page_cost = 4, which
looks plausible for 133000 tuples and (too?) small
effective_cache_size.

> QUERY PLAN
>------------------------------------------------------------------------------------------------------------------------------------
> Seq Scan on report_user_cat_count rucc (cost=0.00..23685.03 rows=133918 width=64) (actual time=0.28..6100.85 rows=129941 loops=1)
> Filter: ((user_id < 1000) AND (utc_date > '2002-01-01 00:00:00-08'::timestamp with time zone))
> Total runtime: 6649.21 msec
>(3 rows)
>
># SET enable_seqscan = false; SET enable_indexscan = true;
>SET
>SET
># EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE user_id < 1000 AND utc_date > '2002-01-01'::TIMESTAMP WITH TIME ZONE;
>INFO: cost_seqscan: run_cost: 23685.025000
> startup_cost: 100000000.000000
>
>INFO: cost_index: run_cost: 366295.018684
> startup_cost: 0.000000
> indexCorrelation: 0.500000
> QUERY PLAN
>-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Index Scan using report_user_cat_count_utc_date_user_id_idx on report_user_cat_count rucc (cost=0.00..366295.02 rows=133918 width=64) (actual time=53.91..3110.42 rows=129941 loops=1)
> Index Cond: ((user_id < 1000) AND (utc_date > '2002-01-01 00:00:00-08'::timestamp with time zone))
> Total runtime: 3667.47 msec
>(3 rows)

Which shows that Postgres does not "_overly_ _favor_ the use of
indexes".

>If I manually set the indexCorrelation to 1.0, however, the planner
>chooses the right plan on its own

Ok, with indexCorrelation == 1.0 we dont have to discuss interpolation
methods, because they all return min_IO_cost.

>Which suggests to me that line 3964 in
>./src/backend/utils/adt/selfuncs.c isn't right for multi-column
>indexes, esp for indexes that are clustered.

Agreed.

> I don't know how to
>address this though...

I guess there is no chance without index statistics.

>FWIW, this is an old data/schema from a defunct project that I can
>give people access to if they'd like.

Is there a dump available for download?

Servus
Manfred

In response to

Re: Correlation in cost_index() at 2003-08-08 18:06:56 from Sean Chittenden

Responses

Re: Correlation in cost_index() at 2003-08-08 23:53:48 from Sean Chittenden

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Bruno Wolff III	2003-08-08 21:23:19	contrib/vacuumlo problem in cvs
Previous Message	Bruce Momjian	2003-08-08 21:05:10	Re: new psql \d command