Re: [PATCH] Fix for slow GIN index queries when "gin_fuzzy_search_limit" setting is relatively small for large tables

From: Adé <ade(dot)hey(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject: Re: [PATCH] Fix for slow GIN index queries when "gin_fuzzy_search_limit" setting is relatively small for large tables
Date: 2020-03-12 04:22:54
Message-ID: CAEknJCfV1aNS5MgXKdRGpmaT9a0TGWifPtQvZCZxAa5pqtf2ew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, Tom. Thanks for taking a look.

> It seems like what you're actually trying
> to accomplish is to ensure that entryLoadMoreItems's "stepright" path
> is taken, instead of re-descending the index from the root.

What I was primarily trying to do is make sure that when entryLoadMoreItems
is called, it loads new items that it didn’t load previously, which would
occur in special cases. But the solution to that goal does result in the
"stepright" path being used. I explain the main goal in a segment of my bug
report this way, though it's a bit longwinded (from
https://www.postgresql.org/message-id/16220-1a0a4f0cb67cafdc@postgresql.org
):

Since the program doesn't load all items into memory at once, it calls the
> "entryLoadMoreItems" function when it needs to get another page of items to
> iterate through. The "entryLoadMoreItems" function calls are passed an
> "advancePast" variable as an argument. This variable decides what leaf page
> in the items/posting tree should more items be retrieved from. Usually when
> iterating through all possible results, execution will exit the do-while
> loop responsible for iteration in order to perform some important actions
> (including the updating of the "advancePast" variable) before returning
> into
> the loop again to iterate over more items. However, when "dropItem" returns
> true in succession a great many times, the do-while loop can not be exited
> for updating the "advancePast" variable until a non-drop finally occurs.
> When this "advancePast" variable is not updated it leads to calls to
> "entryLoadMoreItems" repeatedly returning the same items when stuck in the
> do-while loop by a high succession of dropped items (because "advancePast"
> is never updated to a value after items already iterated through).

> Along with the issue of returning the same items, there's the issue of how
> the "entryLoadMoreItems" function traverses the posting tree from the root
> each time it's called while stuck in the do-while loop. This especially is
> the cause for the bad performance seen for low "gin_fuzzy_search_limit"
> values. In some cases, this situation is made even worse when "advancePast"
> is set to a value that leads to loading a page of items that has relatively
> few items actually past "advancePast", and so it must almost immediately
> call "entryLoadMoreItems" again. But because "advancePast" never gets
> updated, this results in a higher than usual succession of
> "entryLoadMoreItems" function calls (the program loads the same page,
> iterates over the same relatively few items until it goes and loads the
> same
> page again), with each call requiring traversal from the root of the
> posting
> tree down to the same leaf page as before.

> My patch makes it so that when stuck in the do-while loop after many
> successive "dropItems" returning true, the program instead now loads actual
> new items by passing the last item dropped into the "entryLoadMoreItems"
> function instead of the "advancePast" variable that can't be appropriately
> updated while stuck in the do-while loop. This means "entryLoadMoreItems"
> will instead load items ordered right after the last dropped item. This
> last
> item dropped is also the current item ("curItem") and so the
> "entryLoadMoreItems" can directly obtain the next page of items by making a
> step right from the current page, instead of traversing from the root of
> the
> posting tree, which is the most important fix for performance.

In regards to this:

While we're here, what do you think about the comment in the other
> code branch just above:
> /* XXX: shouldn't we apply the fuzzy search limit here? */
> I'm personally inclined to suspect that the fuzzy-search business has
> got lots of bugs, which haven't been noticed because (a) it's so squishily
> defined that one can hardly tell whether a given result is buggy or not,
> and (b) nearly nobody uses it anyway (possibly not unrelated to (a)).
> As somebody who evidently is using it, you're probably more motivated
> to track down bugs than the rest of us.

I think the comment is correct. It should be applied if you are to stay
consistent. Like the comment above that comment says, that code branch is
for the two cases of either (1) reaching the last page of a posting tree or
(2) when an "entry"/keyword has so few results that the item pointers fit
in just one page containing a posting list. If there is a chance of a
dropped item in the other pages of the posting tree, there should be a
chance of dropped items in the last page too for consistency sake at least.
And there should also be a chance of dropped items when iterating a single
posting list of entry with relatively few results. Placing "||
(entry->reduceResult == true && dropItem(entry))" at the end of the while
condition should be all that's needed to apply the fuzzy search limit there.

And I agree that probably usage of the fuzzy search feature is extremely
rare and the way I'm using it probably even more rare. So thank you for
taking a look at it. It's a really great feature for me though and I'm glad
the creator placed it in.

Regards,
Ade

On Tue, Mar 10, 2020 at 7:16 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> =?UTF-8?B?QWTDqQ==?= <ade(dot)hey(at)gmail(dot)com> writes:
> > Like the title says, using "gin_fuzzy_search_limit" degrades speed when
> it
> > has a relatively low setting.
> > ...
> > Attached is SQL to test and observe this issue and also attached is a
> patch
> > I want to eventually submit to a commitfest.
>
> I took a brief look at this. It seems like what you're actually trying
> to accomplish is to ensure that entryLoadMoreItems's "stepright" path
> is taken, instead of re-descending the index from the root. Okay,
> I can see why that'd be a big win, but why are you tying it to the
> dropItem behavior? It should apply any time we're iterating this loop
> more than once. IOW, it seems to me like the logic about when to step
> right is just kinda broken, and this is a band-aid rather than a full fix.
> The performance hit is worse for fuzzy-search mode because it will
> iterate the loop more (relative to the amount of work done elsewhere),
> but there's still a potential for wasted work.
>
> Actually, a look at the code coverage report shows that the
> not-step-right path is never taken at all in our standard regression
> tests. Maybe that just says bad things about the tests' coverage, but
> now I'm wondering whether we could just flush that code path altogether,
> and assume that we should always step right at this point.
>
> [ cc'ing heikki and alexander, who seem to have originated that code
> at 626a120656a75bf4fe64b1d0d83c23cb38d3771a. The commit message says
> it saves a lot of I/O, but I'm wondering if this report disproves that.
> In any case the lack of test coverage is pretty damning. ]
>
> While we're here, what do you think about the comment in the other
> code branch just above:
>
> /* XXX: shouldn't we apply the fuzzy search limit here? */
>
> I'm personally inclined to suspect that the fuzzy-search business has
> got lots of bugs, which haven't been noticed because (a) it's so squishily
> defined that one can hardly tell whether a given result is buggy or not,
> and (b) nearly nobody uses it anyway (possibly not unrelated to (a)).
> As somebody who evidently is using it, you're probably more motivated
> to track down bugs than the rest of us.
>
> regards, tom lane
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-03-12 04:33:21 Re: Refactor compile-time assertion checks for C/C++
Previous Message Ashutosh Bapat 2020-03-12 04:17:36 Re: BEFORE ROW triggers for partitioned tables