Re: How to properly query lots of rows based on timestamps?

From: Alban Hertroys <haramrae(at)gmail(dot)com>
To: Thorsten Schöning <tschoening(at)am-soft(dot)de>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: How to properly query lots of rows based on timestamps?
Date: 2020-08-29 09:29:07
Message-ID: 32CEEF29-ED8C-4300-B663-7B3D31A6B2CF@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


> On 29 Aug 2020, at 10:24, Thorsten Schöning <tschoening(at)am-soft(dot)de> wrote:
>
> Hi all,
>
> I have a table containing around 95 million rows, pretty much only
> storing a timestamp and further IDs of related tables containing the
> actual data in the end.
>
>> CREATE TABLE clt_rec
>> (
>> id BIGSERIAL NOT NULL,
>> oms_rec BIGINT NOT NULL,
>> captured_at TIMESTAMP(6) WITH TIME ZONE NOT NULL,
>> rssi SMALLINT NOT NULL,
>> CONSTRAINT pk_clt_rec PRIMARY KEY (id),
>> CONSTRAINT fk_clt_rec_oms_rec FOREIGN KEY (oms_rec) REFERENCES "oms_rec" ("id"),
>> CONSTRAINT uk_clt_rec_oms_rec UNIQUE (oms_rec)
>> );
>
> In many use cases I need to search all of those rows based on their
> timestamp to find rows arbitrary in the past: Sometimes it's only 15
> minutes into the past, sometimes it's 2 years, sometimes it's finding
> the first day of each month over 15 months for some of those telegrams
> etc. In the end, I pretty often need to compare those timestamps and
> some queries simply take multiple seconds in the end, especially
> adding up if multiple, but slightly different queries need to be
> executed one after another. The following are two abstracts of
> Postgres' query plans:
>
> Plan 1:
>
>> -> Nested Loop (cost=1.14..343169.49 rows=43543 width=20) (actual time=0.313..113.974 rows=34266 loops=3)
>> -> Parallel Index Scan using idx_clt_rec_captured_at on clt_rec (cost=0.57..3437.90 rows=43543 width=24) (actual time=0.153..20.192 rows=34266 loops=3)
>> Index Cond: ((captured_at >= ('2020-08-01 00:00:00+02'::timestamp with time zone - '00:00:00'::interval)) AND (captured_at <= ('2020-08-01 00:00:00+02'::timestamp with time zone + '1 day'::interval)))
>> -> Index Scan using pk_oms_rec on oms_rec (cost=0.57..7.80 rows=1 width=12) (actual time=0.002..0.002 rows=1 loops=102799)
>> Index Cond: (id = clt_rec.oms_rec)

What happens here is that the planner looks up the lower and upper boundaries, everything in between those index nodes is a candidate record. Next, it loops over those to match the other condition of your query (id = clt_rec.oms_rec). You didn’t tell whether there’s an index on that column.

You’d probably see a performance improvement were you to create an index on (captured_at, id). If your Postgres version is somewhat recent, that could even lead to an Index Only Scan.

> Plan 2:
>
>> -> Nested Loop (cost=1.14..836381.50 rows=111934 width=20) (actual time=0.379..911.697 rows=334465 loops=3)
>> -> Parallel Index Scan using idx_clt_rec_captured_at on clt_rec (cost=0.57..8819.57 rows=111934 width=24) (actual time=0.193..154.176 rows=334465 loops=3)
>> Index Cond: ((captured_at >= ('2020-08-28 10:21:06.968+02'::timestamp with time zone - '14 days'::interval)) AND (captured_at <= ('2020-08-28 10:21:06.968+02'::timestamp with time zone + '00:00:00'::interval)))
>> -> Index Scan using pk_oms_rec on oms_rec (cost=0.57..7.39 rows=1 width=12) (actual time=0.002..0.002 rows=1 loops=1003394)
>> Index Cond: (id = clt_rec.oms_rec)

And this situation is very much the same issue, apart from the larger number of candidate records.

> Postgres seems to properly use available indexes, parallel workers and
> stuff like that. But looking at the actual times and compared to all
> the other parts of the query, comparing those timestamps simply takes
> the most time.

It only needs to compare 2 timestamps.

> I've looked into this topic and found statements about that one
> shouldn't put too many rows into the index[1] and stuff like that or
> it will be ignored at all. But that doesn't seem to be the case for me
> according to the plan. OTOH, my index really simply is about the
> column containing the timestamp, no function reducing things to dates
> or stuff like that to reduce the number of rows.
>
>> CREATE INDEX idx_clt_rec_captured_at ON clt_rec USING btree ( captured_at );

Try this:
CREATE INDEX idx_clt_rec_captured_at ON clt_rec USING btree ( captured_at, id );

Alban Hertroys
--
There is always an exception to always.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2020-08-29 15:19:53 Re: How to properly query lots of rows based on timestamps?
Previous Message Thorsten Schöning 2020-08-29 08:24:04 How to properly query lots of rows based on timestamps?