Re: Optimize single tuple fetch from nbtree index

From: Floris Van Nee <florisvannee(at)Optiver(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Optimize single tuple fetch from nbtree index
Date: 2019-08-02 22:18:49
Message-ID: 1564784330839.3645@Optiver.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Tom,

Thanks for your quick reply!

> Regardless of whether there's actually a LIMIT 1? That seems disastrous
> for every other case than the narrow one where the optimization wins.
> Because every other case is now going to have to touch the index page
> twice. That's more CPU and about double the contention --- if you could
> not measure any degradation from that, you're not measuring the right
> thing.

I thought the same as well at first. Note that I did measure degradation of 2-3% as mentioned on some cases, but initially I also expected worse. Do you have any ideas on cases that would suffer the most? I thought the tight inner nested loop that I posted in my performance tests would have this index lookup as bottleneck. I know they are the bottleneck for the LIMIT 1 query (because these improve by a factor 2-3 with the patch). And my theory is that for a LIMIT 3, the price paid for this optimization is highest, because it would touch the page twice and read all items from it, while only returning three of them.

> In principle, you could pass down knowledge of whether there's a LIMIT,
> using the same mechanism used to enable top-N sorting. But it'd have to
> also go through the AM interface layer, so I'm not sure how messy this
> would be.

This was an idea I had as well and I would be willing to implement such a thing if this is deemed interesting enough by the community. However, I didn't want to do this for the first version of this patch, as it would be quite some extra work, which would be useless if the idea of the patch itself gets rejected already. :-) I'd appreciate any pointers in the right direction - I can take a look at how top-N sorting pushes the LIMIT down. Given enough interest for the basic idea of this patch, I will implement it.

>> This calls _bt_first in nbtsearch.c, which will, if there are scankeys, descend the tree to a leaf page and read just the first (or possibly two) tuples. It won't touch the rest of the page yet. If indeed just one tuple was required, there won't be a call to _bt_next and we're done. If we do need more than one tuple, _bt_next will resume reading tuples from the index page at the point where we left off.

> How do you know how many index entries you have to fetch to get a tuple
that's live/visible to the query?

Indeed we don't know that - that's why this initial patch does not make any assumptions about this and just assumes the good-weather scenario that everything is visible. I'm not sure if it's possible to give an estimation of this and whether or not that would be useful. Currently, if it turns out that the tuple is not visible, there'll just be another call to _bt_next again which will resume reading the page as normal. I'm open to implement any suggestions that may improve this.

>> - We need to take into account page splits, insertions and vacuums while we do not have the read-lock in between _bt_first and the first call to _bt_next. This made my patch quite a bit more complicated than my initial implementation.

> Meh. I think the odds that you got this 100% right are small, and the
> odds that it would be maintainable are smaller. There's too much that
> can happen if you're not holding any lock --- and there's a lot of active
> work on btree indexes, which could break whatever assumptions you might
> make today.

Agreed, which is also why I posted this initial version of the patch here already, to get some input from the experts on this topic what assumptions can be made now and in the future. If it turns out that it's completely not feasible to do an optimization like this, because of other constraints in the btree implementation, then we're done pretty quickly here. :-) For what it's worth: the patch at least passes make check consistently - I caught a lot of these edge cases related to page splits and insertions while running the regression tests, which runs the modified bits of code quite often and in parallel. There may be plenty of edge cases left however...

> I'm not unalterably opposed to doing something like this, but my sense
> is that the complexity and probable negative performance impact on other
> cases are not going to look like a good trade-off for optimizing the
> case at hand.

I do think it could be a big win if we could get something like this working. Cases with a LIMIT seem common enough to me to make it possible to add some extra optimizations, especially if that could lead to 2-3x the TPS for these kind of queries. However, it indeed needs to be within a reasonable complexity. If it turns out that in order for us to optimize this, we need to add a lot of extra complexity, it may not be worth it to add it.

> BTW, you haven't even really made the case that optimizing a query that
> behaves this way is the right thing to be doing ... maybe some other
> plan shape that isn't a nestloop around a LIMIT query would be a better
> solution.

It is pretty difficult to come up with any faster plans than this unfortunately. We have a large database with many tables with timeseries data, and when tables get large and when there's an efficient multi-column index and when you want to do these kind of time-base or item-based lookups, the nested loop is generally the fastest option. I can elaborate more on this, but I'm not sure if this thread is the best place for that.

I appreciate you taking the time to take a look at this. I'd be happy to look more into any suggestions you come up with. Working on this has taught me a lot about the internals of Postgres already - I find it really interesting!

-Floris

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-08-02 22:19:17 Re: The unused_oids script should have a reminder to use the 8000-8999 OID range
Previous Message Ivan Panchenko 2019-08-02 22:05:33 jsonb_plperl bug