From: | Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz> |
---|---|
To: | Rafał Rzepecki <divided(dot)mind(at)gmail(dot)com> |
Cc: | ashutosh durugkar <ashucould(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org |
Subject: | Re: Planner sometimes doesn't use a relevant index with IN (subquery) condition |
Date: | 2012-11-12 20:49:38 |
Message-ID: | 50A160E2.1050100@archidevsys.co.nz |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
On 12/11/12 22:06, Rafał Rzepecki wrote:
> This indeed works around the issue. Thanks!
>
> On Mon, Nov 12, 2012 at 9:53 AM, ashutosh durugkar <ashucould(at)gmail(dot)com> wrote:
>> Hey Rafal,
>>
>>
>>> SELECT * FROM (SELECT run_id, utilization FROM stats) AS s WHERE
>> run_id IN (SELECT run_id FROM runs WHERE server_id = 515);
>>
>> could you try this:
>>
>>
>> SELECT * FROM (SELECT run_id, utilization FROM stats) AS s WHERE
>> run_id = ANY(ARRAY(SELECT run_id FROM runs WHERE server_id = 515));
>>
>> Thanks,
>>
>> On Sun, Nov 11, 2012 at 8:48 AM, Rafał Rzepecki <divided(dot)mind(at)gmail(dot)com>
>> wrote:
>>> [Please CC me on replies, as I'm not subscribed; thank you.]
>>>
>>> I've ran into a problem with the query planner and IN (subquery)
>>> conditions which I suspect to be a bug. I'll attempt to describe the
>>> relevant details of my database and explain which behaviour I find
>>> unexpected. I've also tried to trigger this behaviour in a clean
>>> database; I think I've succeeded, but the conditions are a bit
>>> different, so perhaps it's a different problem. I'll describe this
>>> setup in detail below.
>>>
>>> I have a somewhat large table (~2.5M rows), stats, which is quite
>>> often (several records a minute) INSERTed to, but never UPDATEd or
>>> DELETEd from. (In case it's relevant, it has an attached AFTER INSERT
>>> trigger which checks time and rebuilds an aggregate materialized view
>>> every hour.) This is the schema:
>>> # \d+ stats
>>> Table
>>> "serverwatch.stats"
>>> Column | Type |
>>> Modifiers | Storage | Description
>>>
>>> ------------------+-----------------------------+----------------------------------------------------+---------+-------------
>>> id | integer | not null default
>>> nextval('stats_id_seq'::regclass) | plain |
>>> run_id | integer | not null
>>> | plain |
>>> start_time | timestamp without time zone | not null
>>> | plain |
>>> end_time | timestamp without time zone | not null
>>> | plain |
>>> cpu_utilization | double precision |
>>> | plain |
>>> disk_read_ops | bigint |
>>> | plain |
>>> disk_write_ops | bigint |
>>> | plain |
>>> network_out | bigint |
>>> | plain |
>>> network_in | bigint |
>>> | plain |
>>> disk_read_bytes | bigint |
>>> | plain |
>>> disk_write_bytes | bigint |
>>> | plain |
>>> Indexes:
>>> "stats_pkey" PRIMARY KEY, btree (id)
>>> "stats_day_index" btree (run_id, day(stats.*))
>>> "stats_month_index" btree (run_id, month(stats.*))
>>> "stats_week_index" btree (run_id, week(stats.*))
>>> Foreign-key constraints:
>>> "stats_runs" FOREIGN KEY (run_id) REFERENCES runs(id)
>>> Triggers:
>>> stats_day_refresh_trigger AFTER INSERT OR UPDATE ON stats FOR EACH
>>> STATEMENT EXECUTE PROCEDURE mat_view_refresh('serverwatch.stats_day')
>>> Has OIDs: no
>>>
>>> day(), month() and week() functions are just trivial date_trunc on a
>>> relevant field. The referenced table looks like this:
>>> # \d+ runs
>>> Table "serverwatch.runs"
>>> Column | Type |
>>> Modifiers | Storage | Description
>>>
>>> -----------------+-----------------------------+---------------------------------------------------+---------+-------------
>>> id | integer | not null default
>>> nextval('runs_id_seq'::regclass) | plain |
>>> server_id | integer | not null
>>> | plain |
>>> flavor | flavor | not null
>>> | plain |
>>> region | region | not null
>>> | plain |
>>> launch_time | timestamp without time zone | not null
>>> | plain |
>>> stop_time | timestamp without time zone |
>>> | plain |
>>> project_info_id | integer | not null
>>> | plain |
>>> owner_info_id | integer | not null
>>> | plain |
>>> Indexes:
>>> "runs_pkey" PRIMARY KEY, btree (id)
>>> "index_runs_on_flavor" btree (flavor)
>>> "index_runs_on_owner_info_id" btree (owner_info_id)
>>> "index_runs_on_project_info_id" btree (project_info_id)
>>> "index_runs_on_region" btree (region)
>>> "index_runs_on_server_id" btree (server_id)
>>> Foreign-key constraints:
>>> "runs_owner_info_id_fkey" FOREIGN KEY (owner_info_id) REFERENCES
>>> user_infos(id)
>>> "runs_project_info_id_fkey" FOREIGN KEY (project_info_id)
>>> REFERENCES project_infos(id)
>>> Referenced by:
>>> TABLE "stats_day" CONSTRAINT "stats_day_runs" FOREIGN KEY (run_id)
>>> REFERENCES runs(id)
>>> TABLE "stats" CONSTRAINT "stats_runs" FOREIGN KEY (run_id)
>>> REFERENCES runs(id)
>>> Has OIDs: no
>>>
>>> Now consider this query - note I'm using a subselect here because the
>>> problem originally manifested itself with a view:
>>> SELECT * FROM (SELECT run_id, disk_write_ops FROM stats) AS s WHERE
>>> run_id IN (SELECT id FROM runs WHERE server_id = 515);
>>>
>>> As might be expected, the planner chooses to use one of the three
>>> indices with run_id:
>>> http://explain.depesz.com/s/XU3Q
>>>
>>> Now consider a similar query, but with aggregation:
>>> SELECT * FROM (SELECT run_id, SUM(disk_write_ops) FROM stats GROUP BY
>>> run_id) AS s WHERE run_id IN (SELECT id FROM runs WHERE server_id =
>>> 515);
>>>
>>> Now the picture is very different. The planner, unexplicably,
>>> dismisses the index and opts instead to do a full scan on stats, the
>>> table 2.5 million rows big.
>>> http://explain.depesz.com/s/Rqt
>>>
>>> Note that the problem disappears when we replace the IN condition with
>>> literal:
>>> SELECT * FROM (SELECT run_id, SUM(disk_write_ops) FROM stats GROUP BY
>>> run_id) AS s WHERE run_id IN (1815, 1816);
>>>
>>> The ids are the result of the inner select ran separately, so the
>>> query has the exact same result; it's worth pointing out that the
>>> planner has a correct estimate on the selectivity of the condition -
>>> exactly two rows from runs are selected, as expected. But when literal
>>> is used the planner correctly chooses to use the index:
>>> http://explain.depesz.com/s/lYc
>>>
>>> Similarly a correct plan is chosen when we unnest the inner SELECT:
>>> SELECT run_id, SUM(disk_write_ops) FROM stats WHERE run_id IN (SELECT
>>> id FROM runs WHERE server_id = 515) GROUP BY run_id;
>>> http://explain.depesz.com/s/dlwZ
>>>
>>>
>>> I've tried to replicate this on a clean database:
>>> CREATE TABLE runs(run_id serial PRIMARY KEY, server_id INTEGER NOT NULL);
>>> CREATE INDEX runs_server ON runs(server_id);
>>> CREATE TABLE stats(entry_id serial PRIMARY KEY, run_id integer
>>> REFERENCES runs NOT NULL, utilization INTEGER NOT NULL);
>>> CREATE INDEX stats_runs ON stats(run_id);
>>>
>>> Now let's try some queries:
>>> SELECT * FROM (SELECT run_id, utilization FROM stats) AS s WHERE
>>> run_id IN (1212, 2323, 121, 561, 21, 561, 125, 2, 55, 52, 42);
>>> http://explain.depesz.com/s/Kcb - fine, index used
>>>
>>> SELECT * FROM (SELECT run_id, utilization FROM stats) AS s WHERE
>>> run_id IN (SELECT run_id FROM runs WHERE server_id = 515);
>>> http://explain.depesz.com/s/QFs - seqscan!
>>> Obviously it doesn't mean much, as the tables are empty and there are
>>> no stats, but still a radically different plan is chosen for what is
>>> essentially the same query.
>>>
>>> Note that in this case the behaviour is the same even when unnested:
>>> SELECT run_id, utilization FROM stats WHERE run_id IN (SELECT run_id
>>> FROM runs WHERE server_id = 515);
>>> http://explain.depesz.com/s/y3GM
>>>
>>> So, is this a bug in the planner, or am I somehow subtly changing the
>>> semantics of the query and don't notice?
>>> I understand the planner perhaps tries to parallelize queries when a
>>> SELECT is used in the IN clause, but given the stats it doesn't seem
>>> to make much sense.
>>>
>>> Thanks, and let me know if you want me to test something on my
>>> database over here or if there's some relevant info I've ommited.
>>>
>>> (PostgreSQL 9.1.6 on x86_64-unknown-linux-gnu, compiled by gcc
>>> (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2, 64-bit
>>> running on Ubuntu 12.10, ubuntu package
>>> postgresql-9.1-9.1.6-1ubuntu1:amd64, default configuration)
>>> --
>>> Rafał Rzepecki
>>>
>>>
>>> --
>>> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-performance
>>
>
>
Curious, would the following be of any use?
SELECT DISTINCT
r.run_id,
s.utilization
FROM
runs AS r JOIN stats AS s USING (run_id)
WHERE
r.server_id = 515
/**/;/**/
Cheers,
Gavin
From | Date | Subject | |
---|---|---|---|
Next Message | Dave Cramer | 2012-11-12 20:53:20 | Re: performance regression with 9.2 |
Previous Message | Tom Lane | 2012-11-12 20:43:53 | Re: performance regression with 9.2 |