Re: The standard 'why does it take so long' question

From: "Nigel J(dot) Andrews" <nandrews(at)investsystems(dot)co(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Joe Conway <mail(at)joeconway(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: The standard 'why does it take so long' question
Date: 2002-08-09 11:12:43
Message-ID: Pine.LNX.4.21.0208091136240.3235-100000@ponder.fairway2k.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

First I'll just say thanks to Joe Conway for the suggestion of an index on the
poster_id column. Although I didn't show it I do have an index on that field as
well, however, as I understand it the index on (poster_id,time) should fulfil
this same function.

On Fri, 9 Aug 2002, Tom Lane wrote:

> "Nigel J. Andrews" <nandrews(at)investsystems(dot)co(dot)uk> writes:
> > The first question is why would the index scan on chat_user take
> > significantly longer than before?
>
> I think the indexscan is having to skip over more irrelevant data in the
> second case. Think about it: you can scan an index range consisting of
> a single user's posts between times T1 and T2, or you can scan an index
> range consisting of all posts between times T1 and T2. The second
> column of the index will save you from actually going to the heap for
> posts from other users, but you'll still have to pass over those index
> entries, because the contiguous range of index entries that covers the
> data you want will include a a lot of posts from other users.

That was the idea that popped into my head while I was writing the questions so
that certainly seems reasonable to me.

>
> What interests me is why the planner chose the second index when it
> had a choice; I'd have thought its cost models were good enough to
> handle this subtlety. Can you post the pg_stats rows for the columns
> in question?

I really don't know sufficient to be able to see why the stats would favour one
index over the other. Although looking at the pg_stats entries below now I
notice that the correlation for the time column is 1, compared to 0.058 for the
poster_id. As there is a natural order to the data, i.e. it's loaded in time
order, I am not entirely surprised that there's a high correlation, just
slightly surprised the stats gatherer picked up on it.

Note that the poster_ids for the example query are also shown below and are not
listed in the most_common_vals array. I've also shown the human understandable
values for the time limits in the query and due to the nature/distribution of
this column it's unsurprising to find that I managed to pick a time period that
didn't include one of the more common times.

So, what's making the planner choose the slower index is the correlation figure
right?

archive=> select * from pg_stats where attname = 'time' or attname = 'poster_id' and tablename = 'chat_post';
tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation
-----------+-----------+-----------+-----------+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------
chat_post | poster_id | 0 | 2 | 338 | {2149,731,1130,2595,1879,1656,688,1842,2473,1747} | {0.0603333,0.0306667,0.0306667,0.03,0.0286667,0.0276667,0.0273333,0.025,0.02,0.0196667} | {4,252,521,819,1112,1423,1639,1820,2096,2427,2879} | 0.0583958
chat_post | time | 0 | 8 | -0.350654 | {"1998-03-26 11:09:00-05","1999-05-13 00:49:00-04","2000-12-14 13:05:00-05","2001-01-25 12:18:00-05","2002-04-04 11:37:00-05","2002-05-28 15:21:00-04","2002-07-10 09:53:00-04"} | {0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667} | {"1998-03-06 01:26:00-05","1998-08-21 11:00:00-04","1999-02-22 15:15:00-05","1999-07-22 05:37:00-04","2000-01-24 09:53:00-05","2000-10-23 17:45:00-04","2001-02-26 16:22:00-05","2001-06-28 20:19:00-04","2001-12-06 14:06:00-05","2002-04-26 16:05:00-04","2002-08-09 00:10:00-04"} | 1
(2 rows)

archive=> select * from chat_user where name in ('thrifty', 'hope1');
id | name | lower_name
------+---------+------------
2494 | thrifty | thrifty
1112 | hope1 | hope1
(2 rows)

archive=> select 'epoch'::timestamptz + '959904000 seconds'::interval as from, 'epoch'::timestamptz + '1023667200 seconds'::interval as to;
from | to
------------------------+------------------------
2000-06-02 01:00:00+01 | 2002-06-10 01:00:00+01
(1 row)

Thanks for the input Tom.

--
Nigel J. Andrews
Director

---
Logictree Systems Limited
Computer Consultants

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Cornelia Boenigk 2002-08-09 11:25:14 Re: Warning: PostgreSQL query failed: ERROR: [my_tabel]: Permission denied in [my_php_program]
Previous Message Barker 2002-08-09 10:46:58 Warning: PostgreSQL query failed: ERROR: [my_tabel]: Permission denied in [my_php_program]