Re: Odd behavior with indices

From: joe meiring <josephmeiring(at)gmail(dot)com>
To: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Odd behavior with indices
Date: 2016-02-26 20:38:09
Message-ID: CAHd-1_60EiAg6EWsaNsi7asndNcpiBgJaPojSYCjV0eq3LehKg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Here's the distribution of parameter_id's

select count(parameter_id), parameter_id from datavalue group by parameter_id
88169 142889171 815805 178570 124257262 213947049 151225902
24091090 3103877 10633764 11994442 1849232 2014935 4563638
132955919 7

On Fri, Feb 26, 2016 at 2:02 PM, David G. Johnston <
david(dot)g(dot)johnston(at)gmail(dot)com> wrote:

> On Fri, Feb 26, 2016 at 12:43 PM, joe meiring <josephmeiring(at)gmail(dot)com>
> wrote:
>
>> Also available on S.O.:
>>
>>
>> http://stackoverflow.com/questions/35658238/postgres-odd-behavior-with-indices
>>
>> I've got a datavalue table with ~200M rows or so, with indices on both
>> site_id and parameter_id. I need to execute queries like "return all
>> sites with data" and "return all parameters with data". The site table
>> has only 200 rows or so, and the parameter table has only 100 or so rows.
>>
>> The site query is fast and uses the index:
>>
>> EXPLAIN ANALYZEselect *from sitewhere exists (
>> select 1 from datavalue
>> where datavalue.site_id = site.id limit 1);
>>
>> Seq Scan on site (cost=0.00..64.47 rows=64 width=113) (actual time=0.046..1.106 rows=89 loops=1)
>> Filter: (SubPlan 1)
>> Rows Removed by Filter: 39
>> SubPlan 1
>> -> Limit (cost=0.44..0.47 rows=1 width=0) (actual time=0.008..0.008 rows=1 loops=128)
>> -> Index Only Scan using ix_datavalue_site_id on datavalue (cost=0.44..8142.71 rows=248930 width=0) (actual time=0.008..0.008 rows=1 loops=128)
>> Index Cond: (site_id = site.id)
>> Heap Fetches: 0
>> Planning time: 0.361 ms
>> Execution time: 1.149 ms
>>
>> The same query for parameters is rather slow and does NOT use the index:
>>
>> EXPLAIN ANALYZEselect *from parameterwhere exists (
>> select 1 from datavalue
>> where datavalue.parameter_id = parameter.id limit 1);
>>
>> Seq Scan on parameter (cost=0.00..20.50 rows=15 width=2648) (actual time=2895.972..21331.701 rows=15 loops=1)
>> Filter: (SubPlan 1)
>> Rows Removed by Filter: 6
>> SubPlan 1
>> -> Limit (cost=0.00..0.34 rows=1 width=0) (actual time=1015.790..1015.790 rows=1 loops=21)
>> -> Seq Scan on datavalue (cost=0.00..502127.10 rows=1476987 width=0) (actual time=1015.786..1015.786 rows=1 loops=21)
>> Filter: (parameter_id = parameter.id)
>> Rows Removed by Filter: 7739355
>> Planning time: 0.123 ms
>> Execution time: 21331.736 ms
>>
>> What the deuce is going on here? Alternatively, whats a good way to do
>> this?
>>
>> Any help/guidance appreciated!
>>
>>
>>
>> Some of the table description:
>>
>> \d datavalue
>>
>> id BIGINT DEFAULT nextval('datavalue_id_seq'::regclass) NOT NULL,
>> value DOUBLE PRECISION NOT NULL,
>> site_id INTEGER NOT NULL,
>> parameter_id INTEGER NOT NULL,
>> deployment_id INTEGER,
>> instrument_id INTEGER,
>> invalid BOOLEAN,
>> Indexes:
>> "datavalue_pkey" PRIMARY KEY, btree (id)
>> "datavalue_datetime_utc_site_id_parameter_id_instrument_id_key" UNIQUE CONSTRAINT, btree (datetime_utc, site_id, parameter_id, instrument_id)
>> "ix_datavalue_instrument_id" btree (instrument_id)
>> "ix_datavalue_parameter_id" btree (parameter_id)
>> "ix_datavalue_site_id" btree (site_id)
>> "tmp_idx" btree (site_id, datetime_utc)
>> Foreign-key constraints:
>> "datavalue_instrument_id_fkey" FOREIGN KEY (instrument_id) REFERENCES instrument(id) ON UPDATE CASCADE ON DELETE CASCADE
>> "datavalue_parameter_id_fkey" FOREIGN KEY (parameter_id) REFERENCES parameter(id) ON UPDATE CASCADE ON DELETE CASCADE
>> "datavalue_site_id_fkey" FOREIGN KEY (site_id) REFERENCES coastal.site(id) ON UPDATE CASCADE ON DELETE CASCADE
>> "datavalue_statistic_type_id_fkey"
>>
>>
>> ​I'm not great with the details but the short answer - aside from the
> fact that you should consider increasing the statistics on these columns -
> is that at a certain point querying the index and then subsequently
> checking the table for visibility is more expensive than simply scanning
> and then discarding ​the extra rows.
>
> The fact that you could perform an INDEX ONLY scan in the first query
> makes that cost go away since no subsequent heap check is required. In the
> parameters query the planner thinks it needs 1.5 million of the rows and
> will have to check each of them for visibility. It decided that scanning
> the entire table was more efficient.
>
> The LIMIT 1 in both queries should not be necessary. The planner is smart
> enough to stop once it finds what it is looking for. In fact the LIMIT's
> presence may be a contributing factor...but I cannot say for sure.
>
> A better query seems like it would be:
>
> WITH active_sites AS (
> SELECT DISTINCT site_id FROM datavalues;
> )
> SELECT *
> FROM sites
> JOIN active_sites USING (site_id);
>
> David J.
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message David G. Johnston 2016-02-26 20:56:19 Re: Odd behavior with indices
Previous Message David G. Johnston 2016-02-26 20:02:30 Re: Odd behavior with indices