Re: Optimizing search query with sorting by creation field

From: Vijaykumar Jain <vijaykumarjain(dot)github(at)gmail(dot)com>
To: Droid Tools <droidtools(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: Optimizing search query with sorting by creation field
Date: 2021-05-07 12:12:50
Message-ID: CAM+6J94+HsZus0BigkrwGr9UFD87bYNNSK=ke4EYw+a5HARKLA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

What is your baseline expectation?
With what size of db table, what query should take how much time
How much server resources can be used?

If this seems to be a timeseries db,
Are the rows append only and random insertion order ?

You are create partitions based on time and sub partitions on some other
field that can be used as predictable to exclude scanning other partitions
and speed up retrieved rows.
Clustering of tables based on timestamp may help reduce sort time if order
by is always used.

Basically a lot of stuff for improvement I would list are used by
timescaledb which is optimized for time series based queries.

If you want or do not want to use timescaledb,
I guess it has enough content to help plan time series based data query and
retrieve data.

On Fri, May 7, 2021, 6:12 AM Droid Tools <droidtools(at)gmail(dot)com> wrote:

> Hi,
>
> I'm looking for tips on optimizing a search query where someone searches
> for content within a post and wants to sort the results by a timestamp. I
> have a table `posts` with a `created` field (timestamp) and a post_tsv
> column (TSVECTOR). Likewise I have a GIN Index on the `post_tsv` field and
> a separate index on the `created` field.
>
> My initial, naive, attempt was to simply to do something like:
>
> ```
> SELECT * FROM posts WHERE post_tsv @@ websearch_to_tsquery(?) ORDER BY
> created DESC
> ```
>
> However, I didn't realize in this case the `created` index would be
> ignored, which means if there was a large number of posts returned, this
> query would take several seconds to execute. Also the planner would do
> weird things even if the result set was small and still take several
> seconds to execute. Currently I've papered over the problem by issuing a
> subquery and sorting that instead.
>
> ```
> SELECT * FROM
> (SELECT * FROM posts WHERE post_tsv @@ websearch_to_tsquery(?) LIMIT
> 10,000) q
> ORDER BY created DESC
> ```
>
> In short I execute the search, limit that to 10,000 rows, and then order
> the 10,000 rows that were returned. This worked amazingly for queries that
> returned fewer than 10,000 rows as those queries went from taking several
> seconds to run down to a handful of milliseconds. The problem is for
> queries with more than 10,000 rows you essentially end up with random
> results. I'm still not using the created index, but sorting 10,000 rows in
> memory is relatively fast.
>
> I'm stuck where to go from here - what I would like, since I know I will
> only ever ORDER BY the created field is to build some index where the
> default ordering is by the created field. GIN, as I understand it, doesn't
> support indexing in this manner (using one of the columns as a sort field).
> Is there anything else I could try?
>
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Thomas Boussekey 2021-05-07 13:52:24 Re: [RPM/CentOS7] Need to disable repo_gpgcheck on pgdg-common when using RPM version 42.0-17.1
Previous Message Devrim Gündüz 2021-05-07 11:55:39 Re: [RPM/CentOS7] Need to disable repo_gpgcheck on pgdg-common when using RPM version 42.0-17.1