Re: GIN improvements part2: fast scan

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GIN improvements part2: fast scan
Date: 2014-03-12 16:02:58
Message-ID: 53208532.8090005@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 02/26/2014 11:25 PM, Alexander Korotkov wrote:
> On Thu, Feb 27, 2014 at 1:07 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:
>
>> On Thu, Feb 20, 2014 at 1:48 PM, Heikki Linnakangas <
>> hlinnakangas(at)vmware(dot)com> wrote:
>>
>>> On 02/09/2014 12:11 PM, Alexander Korotkov wrote:
>>>
>>>> I've rebased catalog changes with last master. Patch is attached. I've
>>>> rerun my test suite with both last master ('committed') and attached
>>>> patch ('ternary-consistent').
>>>>
>>>
>>> Thanks!
>>>
>>>
>>> method | sum
>>>> ------------------------+------------------
>>>> committed | 143491.715000001
>>>> fast-scan-11 | 126916.111999999
>>>> fast-scan-light | 137321.211
>>>> fast-scan-light-heikki | 138168.028000001
>>>> master | 446976.288
>>>> ternary-consistent | 125923.514
>>>>
>>>> I explain regression in last master by change of MAX_MAYBE_ENTRIES from 8
>>>> to 4.
>>>>
>>>
>>> Yeah, probably. I set MAX_MAYBE_ENTRIES to 8 in initial versions to make
>>> sure we get similar behavior in Tomas' tests that used 6 search terms. But
>>> I always felt that it was too large for real queries, once we have the
>>> catalog changes, that's why I lowered to 4 when committing. If an opclass
>>> benefits greatly from fast scan, it should provide the ternary consistent
>>> function, and not rely on the shim implementation.
>>>
>>>
>>> I'm not sure about decision to reserve separate procedure number for
>>>> ternary consistent. Probably, it would be better to add ginConfig method.
>>>> It would be useful for post 9.4 improvements.
>>>>
>>>
>>> Hmm, it might be useful for an opclass to provide both, a boolean and
>>> ternary consistent function, if the boolean version is significantly more
>>> efficient when all the arguments are TRUE/FALSE. OTOH, you could also do a
>>> quick check through the array to see if there are any MAYBE arguments,
>>> within the consistent function. But I'm inclined to keep the possibility to
>>> provide both versions. As long as we support the boolean version at all,
>>> there's not much difference in terms of the amount of code to support
>>> having them both for the same opclass.
>>>
>>> A ginConfig could be useful for many other things, but I don't think it's
>>> worth adding it now.
>>>
>>>
>>> What's the difference between returning GIN_MAYBE and GIN_TRUE+recheck?
>>> We discussed that earlier, but didn't reach any conclusion. That needs to
>>> be clarified in the docs. One possibility is to document that they're
>>> equivalent. Another is to forbid one of them. Yet another is to assign a
>>> different meaning to each.
>>>
>>> I've been thinking that it might be useful to define them so that a MAYBE
>>> result from the tri-consistent function means that it cannot decide if you
>>> have a match or not, because some of the inputs were MAYBE. And
>>> TRUE+recheck means that even if all the MAYBE inputs were passed as TRUE or
>>> FALSE, the result would be the same, TRUE+recheck. The practical difference
>>> would be that if the tri-consistent function returns TRUE+recheck, ginget.c
>>> wouldn't need to bother fetching the other entries, it could just return
>>> the entry with recheck=true immediately. While with MAYBE result, it would
>>> fetch the other entries and call tri-consistent again. ginget.c doesn't
>>> currently use the tri-consistent function that way - it always fetches all
>>> the entries for a potential match before calling tri-consistent, but it
>>> could. I had it do that in some of the patch versions, but Tomas' testing
>>> showed that it was a big loss on some queries, because the consistent
>>> function was called much more often. Still, something like that might be
>>> sensible in the future, so it might be good to distinguish those cases in
>>> the API now. Note that ginarrayproc is already using the return values like
>>> that: in GinContainedStrategy, it always returns TRUE+recheck regardless of
>>> the inputs, but in other cases it uses GIN_MAYBE.
>>
>>
>> Next revision of patch is attached.
>>
>> In this version opclass should provide at least one consistent function:
>> binary or ternary. It's expected to achieve best performance when opclass
>> provide both of them. However, tests shows opposite :( I've to recheck it
>> carefully.
>>
>
> However, it's not!
> This revision of patch completes my test case in 123330 ms. This is
> slightly faster than previous revision.

Great. Committed, I finally got around to it.

Some minor changes: I reworded the docs and also updated the table of
support functions in xindex.sgml. I refactored the query in
opr_sanity.sql, to make it easier to read, and to check more carefully
that the required support functions are present. I also added a runtime
check to avoid crashing if neither is present.

A few things we ought to still discuss:

* I just noticed that the dummy trueTriConsistentFn returns GIN_MAYBE,
rather than GIN_TRUE. The equivalent boolean version returns 'true'
without recheck. Is that a typo, or was there some reason for the
discrepancy?

* Come to think of it, I'm not too happy with the name GinLogicValue.
It's too vague. It ought to include "ternary" or "tri-value" or
something like that. How about renaming it to "GinTernaryValue" or just
"GinTernary"? Any better suggestion for the name?

* This patch added a triConsistent function for array and tsvector
opclasses. Were you planning to submit a patch to do that for the rest
of the opclasses, like pg_trgm? (it's getting awfully late for that...)

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-03-12 16:09:20 Re: COPY table FROM STDIN doesn't show count tag
Previous Message Tom Lane 2014-03-12 15:57:04 Re: COPY table FROM STDIN doesn't show count tag