Re: GIN improvements part2: fast scan

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GIN improvements part2: fast scan
Date: 2014-02-03 15:36:29
Message-ID: CAPpHfds6xKm8CtfhG8zhFtD7WXTNW3RZCgdWiFfiwaQ0qcobLw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 30, 2014 at 8:38 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> On 01/30/2014 01:53 AM, Tomas Vondra wrote:
>
>> (3) A file with explain plans for 4 queries suffering ~2x slowdown,
>> and explain plans with 9.4 master and Heikki's patches is available
>> here:
>>
>> http://www.fuzzy.cz/tmp/gin/queries.txt
>>
>> All the queries have 6 common words, and the explain plans look
>> just fine to me - exactly like the plans for other queries.
>>
>> Two things now caught my eye. First some of these queries actually
>> have words repeated - either exactly like "database & database" or
>> in negated form like "!anything & anything". Second, while
>> generating the queries, I use "dumb" frequency, where only exact
>> matches count. I.e. "write != written" etc. But the actual number
>> of hits may be much higher - for example "write" matches exactly
>> just 5% documents, but using @@ it matches more than 20%.
>>
>> I don't know if that's the actual cause though.
>>
>
> I tried these queries with the data set you posted here:
> http://www.postgresql.org/message-id/52E4141E.60609@fuzzy.cz. The reason
> for the slowdown is the extra consistent calls it causes. That's expected -
> the patch certainly does call consistent in situations where it didn't
> before, and if the "pre-consistent" checks are not able to eliminate many
> tuples, you lose.
>
> So, what can we do to mitigate that? Some ideas:
>
> 1. Implement the catalog changes from Alexander's patch. That ought to
> remove the overhead, as you only need to call the consistent function once,
> not "both ways". OTOH, currently we only call the consistent function if
> there is only one unknown column. If with the catalog changes, we always
> call the consistent function even if there are more unknown columns, we
> might end up calling it even more often.
>
> 2. Cache the result of the consistent function.
>
> 3. Make the consistent function cheaper. (How? Magic?)
>
> 4. Use some kind of a heuristic, and stop doing the pre-consistent checks
> if they're not effective. Not sure what the heuristic would look like.
>

I'm fiddling around with following idea of such heuristic:
1) Sort entries ascending by estimate of TIDs number. Assume number of
entries to be n.
2) Sequentially call ternary consistent with first m of GIN_FALSE and rest
n-m of GIN_MAYBE until it returns GIN_FALSE.
3) Decide whether to use fast-scan depending on number of TIDs in first m
entries and number of TIDs in rest (n-m) entries. Something like
number_of_TIDs_in_frist_m_entries * k < number_of_TIDs_in_rest_n-m_entries
where k is some quotient.
For sure, it rough method, but it should be fast. Without natural
implementation of ternary consistent function algorithm will be slightly
different: only m values close to n should be checked.
I'm going to implement this heuristic against last version of your patch.

------
With best regards,
Alexander Korotkov.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-02-03 15:49:30 pgsql: Document a few more regression test hazards.
Previous Message Andres Freund 2014-02-03 15:27:23 Re: jsonb and nested hstore