Quick Links

Re: Slow SELECT

From:	Frank Millman <frank(at)chagford(dot)com>
To:	Postgres General <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Slow SELECT
Date:	2020-05-26 10:25:26
Message-ID:	d4676b1c-a829-741b-1112-1be4e588beb4@chagford.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On 2020-05-26 11:27 AM, Charles Clavadetscher wrote:
> On 2020-05-26 11:10, Charles Clavadetscher wrote:
>> Hello
>>
>> On 2020-05-26 10:38, Frank Millman wrote:
>>> On 2020-05-26 9:32 AM, Olivier Gautherot wrote:
>>>> Hi Frank,
>>>>
>>>> On Tue, May 26, 2020 at 9:23 AM Frank Millman <frank(at)chagford(dot)com
>>>> <mailto:frank(at)chagford(dot)com>> wrote:
>>>>
>>>>     Hi all
>>>>
>>>>     I have a SELECT that runs over 5 times slower on PostgreSQL
>>>> compared
>>>>     with Sql Server and sqlite3. I am trying to understand why.
>>>>
>>>>     I have a table that looks like this (simplified) -
>>>>
>>>>     CREATE TABLE my_table (
>>>>      row_id SERIAL PRIMARY KEY,
>>>>      deleted_id INT DEFAULT 0,
>>>>      fld_1 INT REFERENCES table_1(row_id),
>>>>      fld_2 INT REFERENCES table_2(row_id),
>>>>      fld_3 INT REFERENCES table_3(row_id),
>>>>      fld_4 INT REFERENCES table_4(row_id),
>>>>      tran_date DATE,
>>>>      tran_total DEC(21,2)
>>>>      );
>>>>
>>>>     CREATE UNIQUE INDEX my_table_ndx ON my_table (fld_1, fld_2, fld_3,
>>>>     fld_4, tran_date) WHERE deleted_id = 0;
>>>>
>>>>     The table sizes are -
>>>>      my_table : 167 rows
>>>>      table_1 : 21 rows
>>>>      table_2 : 11 rows
>>>>      table_3 : 3 rows
>>>>      table_4 : 16 rows
>>>>
>>>>     Therefore for each tran_date in my_table there are potentially
>>>>     21x11x3x16 = 11088 rows. Most will be null.
>>>>
>>>>     I want to select the row_id for the last tran_date for each of
>>>> those
>>>>     potential groups. This is my select -
>>>>
>>>>      SELECT (
>>>>      SELECT a.row_id FROM my_table a
>>>>      WHERE a.fld_1 = b.row_id
>>>>      AND a.fld_2 = c.row_id
>>>>      AND a.fld_3 = d.row_id
>>>>      AND a.fld_4 = e.row_id
>>>>      AND a.deleted_id = 0
>>>>      ORDER BY a.tran_date DESC LIMIT 1
>>>>      )
>>>>      FROM table_1 b, table_2 c, table_3 d, table_4 e
>>>>
>>>>     Out of 11088 rows selected, 103 are not null.
>>>>
>>>>     On identical data, this takes 0.06 sec on SQL Server, 0.04 sec on
>>>>     sqlite3, and 0.31 sec on PostgreSQL.
>>>>
>>>>
>>>> SQL Server does a good job at caching data in memory. PostgreSQL
>>>> does too on consecutive calls to the same table. What execution time
>>>> do you get if you issue the query a second time?
>>>>
>>>> My first guess would be to add an index on my_table.tran_date and
>>>> check in EXPLAIN that you don't have a SEQUENTIAL SCAN on that table.
>>>>
>>>>     I have looked at the EXPLAIN, but I don't really know what to
>>>> look for.
>>>>     I can supply it if that would help.
>>>>
>>>>     Thanks for any advice.
>>>>
>>>
>>> Thanks Olivier. Unfortunately that did not help.
>>>
>>> I was already running the query twice and only timing the second one.
>>>
>>> I added the index on tran_date. The timing is the same, and EXPLAIN
>>> shows that it is using a SEQUENTIAL SCAN.
>>>
>>> Here is the EXPLAIN -
>>>
>>>
>>> Nested Loop (cost=0.00..64155.70 rows=11088 width=4)
>>>    -> Nested Loop (cost=0.00..10.36 rows=528 width=12)
>>>          -> Nested Loop (cost=0.00..2.56 rows=33 width=8)
>>>                -> Seq Scan on table_2 c (cost=0.00..1.11 rows=11
>>> width=4)
>>>                -> Materialize (cost=0.00..1.04 rows=3 width=4)
>>>                      -> Seq Scan on table_3 d (cost=0.00..1.03
>>> rows=3 width=4)
>>>          -> Materialize (cost=0.00..1.24 rows=16 width=4)
>>>                -> Seq Scan on table_4 e (cost=0.00..1.16 rows=16
>>> width=4)
>>>    -> Materialize (cost=0.00..1.31 rows=21 width=4)
>>>          -> Seq Scan on table_1 b (cost=0.00..1.21 rows=21 width=4)
>>>    SubPlan 1
>>>      -> Limit (cost=5.77..5.77 rows=1 width=8)
>>>            -> Sort (cost=5.77..5.77 rows=1 width=8)
>>>                  Sort Key: a.tran_date DESC
>>>                  -> Seq Scan on my_table a (cost=0.00..5.76 rows=1
>>> width=8)
>>>                        Filter: ((fld_1 = b.row_id) AND (fld_2 =
>>> c.row_id) AND (fld_3 = d.row_id) AND (fld_4 = e.row_id) AND
>>> (deleted_id = 0))
>>>
>>>
>>> Frank
>>
>> If I see it correct, the query runs sequential scans on all tables,
>> i.e. table_1 to table_4.
>> Do you have an index on the referenced keys (row_id) in table_1 to
>> table_4?
>>
>> It happens often that referenced keys are not indexed, leading to poor
>> execution plans.
>>
>> Bye
>> Charles
>
> I noticed later that you have very small tables. This will probably lead
> to a sequential scan althought there is an index in place.
>
> I am not sure if it makes a difference, but what about using explicit
> joins?
>
> SELECT a.row_id FROM my_table a
> JOIN b table_1 ON (b.row_id = a.fld_1)
> JOIN c table_2 ON (c.row_id = a.fld_2)
> JOIN d table_3 ON (d.row_id = a.fld_3)
> JOIN e table_4 ON (e.row_id = a.fld_4)
> WHERE a.deleted_id = 0
> ORDER BY a.tran_date DESC LIMIT 1;
>

Thanks, Charles. I tried that, but unfortunately it produces a different
result. I need to test for every possible combination of fld1-4, and get
the highest date for each one. Using joins only tests existing
combinations, and gets the highest date for all of them combined.

Seel my reply to David Rowley. I do not fully understand his solution
yet, but it seems to be what I am looking for.

Thanks again

Frank

In response to

Re: Slow SELECT at 2020-05-26 09:27:12 from Charles Clavadetscher

Browse pgsql-general by date

	From	Date	Subject
Next Message	Frank Millman	2020-05-26 10:26:12	Re: Slow SELECT
Previous Message	David Rowley	2020-05-26 10:04:07	Re: Slow SELECT