Optimize single tuple fetch from nbtree index

From: Floris Van Nee <florisvannee(at)Optiver(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Optimize single tuple fetch from nbtree index
Date: 2019-08-02 15:23:23
Message-ID: 1564759403216.11445@Optiver.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

While I was reviewing some code in another patch, I stumbled upon a possible optimization in the btree index code in nbtsearch.c for queries using 'LIMIT 1'. I have written a small patch that implements this optimization, but I'd like some feedback on the design of the patch, whether it is correct at all to use this optimization, and whether the performance tradeoffs are deemed worth it by the community.

Basically, an example of the case I'd like to optimize is the following. Given a table 'tbl' with an index on columns (k,ts DESC):

SELECT* FROM tbl WHERE k=:val AND ts<=:timestamp ORDER BY k, ts DESC LIMIT 1;

And, even more importantly, when this query gets called in an inner loop like:

SELECT* FROM generate_series(:start_ts, :end_ts, :interval) ts -- perhaps thousands of iterations, could also be a loop over values of 'k' rather than timestamps. this is just an example

CROSS JOIN LATERAL (

SELECT* FROM tbl WHERE k=:val AND ts<=:timestamp ORDER BY k, ts DESC LIMIT 1;

) _;

With time-series data, this case often arises as you have a certain natural key for which you store updates as they occur. Getting the state of k at a specific time then boils down to the given query there, which is almost always the fastest way to get this information, since the index scan with LIMIT 1 is very fast already. However, there seems to be a possibility to make this even faster (up to nearly 3x faster in test cases that use this nested loop of index scans)

Every time the index scan is done, all tuples from the leaf page are read in nbtsearch.c:_bt_readpage. The idea of this patch is to make an exception for this *only* the first time amgettuple gets called. This calls _bt_first in nbtsearch.c, which will, if there are scankeys, descend the tree to a leaf page and read just the first (or possibly two) tuples. It won't touch the rest of the page yet. If indeed just one tuple was required, there won't be a call to _bt_next and we're done. If we do need more than one tuple, _bt_next will resume reading tuples from the index page at the point where we left off.

There are a few caveats:

- Possible performance decrease for queries that need a small number of tuples (but more than one), because they now need to lock the same page twice. This can happen in several cases, for example: LIMIT 3; LIMIT 1 but the first tuple returned does not match other scan conditions; LIMIT 1 but the tuple returned is not visible; no LIMIT at all but there are just only a few matching rows.

- We need to take into account page splits, insertions and vacuums while we do not have the read-lock in between _bt_first and the first call to _bt_next. This made my patch quite a bit more complicated than my initial implementation.

I did performance tests for some best case and worst case test scenarios. TPS results were stable and reproducible in re-runs on my, otherwise idle, server. Attached are the full results and how to reproduce. I picked test cases that show best performance as well as worst performance compared to master. Summary: the greatest performance improvement can be seen for the cases with the subquery in a nested loop. In a nested loop of 100 times, the performance is roughly two times better, for 10000 times the performance is roughly three times better. For most test cases that don't use LIMIT 1, I couldn't find a noticeable difference, except for the nested loop with a LIMIT 3 (or similarly, a nested loop without any LIMIT-clause that returns just three tuples). This is also theoretically the worst-case test case, because it has to lock the page again and then read it, just for one tuple. In this case, I saw TPS decrease by 2-3% in a few cases (details in the attached file), due to it having to lock/unlock the same page in both _bt_first and _bt_next.

A few concrete questions to the community:

- Does the community also see this as a useful optimization?

- Is the way it is currently implemented safe? I struggled quite a bit to get everything working with respect to page splits and insertions. In particular, I don't know why in my patch, _bt_find_offset_for_tid needs to consider searching for items with an offset *before* the passed offset. As far as my understanding goes, this could only happen when the index gets vacuumed in the mean-time. However, we hold a pin on the buffer the whole time (we even assert this), so vacuum should not have been possible. Still, this code gets triggered sometimes, so it seems to be necessary. Perhaps someone in the community who's more of an expert on this can shed some light on it.

- What are considered acceptable performance tradeoffs in this case? Is a performance degradation in any part generally not acceptable at all?

I'd also welcome any feedback on the process - this is my first patch and while I tried to follow the guidelines, I may have missed something along the way.

Attachments:

- out_limit.txt: pgbench results for patched branch

- out_master.txt pgbench results for master branch (can be diffed with out_limit.txt to efficiently see the difference)

- init_test.sql: creates a simple table for the test cases and fills it with data

- test_fwd.sql: the nested loop example, parameters :nlimit and :nitems to specify how many rows per inner loop to limit to and how many iterations of the loop need to be done. we're returning a sum() over a column to make sure transferring result data over the socket is not the bottleneck - this is to show the worst-case behavior of this patch. When selecting the full row instead of the sum, the performance difference is negligible for the 'bad' cases, while still providing great (up to 2.5x) improvements for the 'good' cases. This can be tested by changing the sum() to select a column per row instead.

- test_fwd_eq.sql: nested loop with simple unique equality select

- test_fwd_single.sql: single query with LIMIT without nested loop

-Floris

Attachment Content-Type Size
0001-Optimize-single-tuple-fetch-from-nbtree-index-v01.patch application/octet-stream 22.8 KB
init_test.sql application/octet-stream 190 bytes
out_limit.txt text/plain 9.5 KB
out_master.txt text/plain 9.4 KB
test_fwd.sql application/octet-stream 150 bytes
test_fwd_eq.sql application/octet-stream 104 bytes
test_fwd_single.sql application/octet-stream 73 bytes

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Martijn van Oosterhout 2019-08-02 15:40:17 [PATCH] Improve performance of NOTIFY over many databases (v2)
Previous Message Tomas Vondra 2019-08-02 15:21:23 Re: Memory-Bounded Hash Aggregation