Looking for an index that supports top-n searches by enforcing a max-n automatically

From: Morris de Oryx <morrisdeoryx(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Looking for an index that supports top-n searches by enforcing a max-n automatically
Date: 2024-04-05 00:27:28
Message-ID: CAKqncci5FC75AE9XLjwgT0=6uw40i=C80Zi5XzX4pXw4yoPwyw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Looking for an index to support top-n searches, were n has a fixed maximum

Recently, I've been looking at strategies to handle top-n queries in
Postgres. In my current cases, we've got definition tables, and very large
related tables. Here's a stripped-down example, the real tables are much
wider.

CREATE TABLE IF NOT EXISTS data.inventory (
id uuid NOT NULL DEFAULT NULL PRIMARY KEY,
inv_id uuid NOT NULL DEFAULT NULL
);

CREATE TABLE IF NOT EXISTS data.scan (
id uuid NOT NULL DEFAULT NULL PRIMARY KEY,
inv_id uuid NOT NULL DEFAULT NULL
scan_dts_utc timestamp NOT NULL DEFAULT NOW(); -- We run out
servers on UTC
);

Every item in inventory is scanned when it passes through various steps in
a clean-dispatch-arrive-use-clean sort of a work cycle. The ratio between
inventory and scan is 1:0-n, where n can be virtually any number. In
another table pair like this, the average is 1:1,000. In the inventory
example, it's roughly 0:200,000. The distribution of related row counts is
all over the map. The reasons behind these ratios sometimes map to valid
processes, and sometimes are a consequence of some low-quality data leaking
into the system. In the case of inventory, the results make sense. In our
case:

* The goal value for n is often 1, and not more than up to 25.

* Depending on the tables, the % of rows that are discarded because they're
past the 25th most recent record is 97% or more.

* Partial indexes do not work as well on huge tables as I hoped. The same
data copied via a STATEMENT trigger into a thin, subsetted table is much
faster. I think this has to do with the increase in random disk access
required for a table and/or index with more pages spread around on the disk.

* We can't filter in advance by *any* reasonable date range. Could get 25
scans on one item in an hour, or a year, or five years, or never.

We're finding that we need the top-n records more and more often, returned
quickly. This gets harder to do as the table(s) grow.

SELECT id, scan_dts_utc
FROM data.scan
WHERE inv_id = 'b7db5d06-8275-224d-a38a-ac263dc1c767' curve.
ORDER BY scan_dts_utc DESC
LIMIT 25; -- Full search product might be 0, 200K, or anything
in-between. Not on a bell curve.

A compound index works really well to optimize these kinds of searches:

CREATE INDEX scan_inv_id_scan_time_utc_dts_idx
ON ascendco.analytic_scan (inv_id, scan_time_utc_dts DESC);

What I'm wondering is if there is some index option, likely not with a
B-tree, that can *automatically* enforce a maximum-length list of top
values, based on a defined sort

CREATE INDEX scan_inv_id_scan_time_utc_dts_idx
ON ascendco.analytic_scan (inv_id, scan_time_utc_dts DESC) --
This defines the ordering
LIMIT 25; --
This sets the hard max for n

The goal is to have an automatically maintained list of the top values *in*
the index itself. In the right situations (like ours), this reduces the
index size by 20x or more. Smaller index, faster results. And, since the
index is on the source table, the row references are already there.
(Something I lose when maintaining this by hand in a side/shadow/top table.)

I've looked at a ton of plans, and Postgres *clearly* goes to a lot of
effort to recognize and optimize top-n searches already. That's
encouraging, as it suggests that the planner takes LIMIT into account.
(I've picked up already that maintaining the purity of the planner and
executor abstractions is a core value to the project.)

And, for sure, I can build and maintain my own custom, ordered list in
various ways. None of them seem like they can possibly rival the trimming
behavior being handled by an index.

I'm out over my skis here, but I'm intuiting that this might be a job for
one of the multi-value/inverted index types/frameworks. I tried some
experiments, but only got worse results.

Hope that reads as understandable...grateful for any suggestions.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Corey Huinker 2024-04-05 01:30:32 Re: Statistics Import and Export
Previous Message Michael Paquier 2024-04-05 00:06:44 Re: Autogenerate some wait events code and documentation