Re: [pg_trgm] Making similarity(?, ?) < ? use an index

From: Greg Navis <contact(at)gregnavis(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: [pg_trgm] Making similarity(?, ?) < ? use an index
Date: 2016-06-08 09:16:17
Message-ID: CAA6WWt8xgOc2PeJRtmHcGYnQuqOr1j561p7hakefwZGT0whwJQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Thanks for the replies.

On Sat, Jun 4, 2016 at 8:48 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> On Sat, Jun 4, 2016 at 2:50 AM, Greg Navis <contact(at)gregnavis(dot)com> wrote:
> > Thanks for your replies.
> >
> > Sorry for confusion. Instead of `similarity(lhs, rhs) >= show_limit()`,
> > which of course is completely equivalent to `lhs % rhs`, I wanted to
> write
> > `similarity(lhs, rhs) >= my_custom_threshold`. It seems that the approach
> > with ternary operators is quite a bit of work. I might have a simpler
> idea:
>
> The problem is that you would have to build your index on the
> expression (similarity(lhs,rhs)), and the then one of the lhs or rhs
> would have to be a column, and the other would have to be a constant
> string at the time you define the index, which is the same constant
> string as specified when you want to use the index. This could work
> as long as there are only a few strings you ever care about checking
> similarity to, and you would build an index for each one. These would
> be ordinary btree indexes, not gin indexes. I've done this before,
> only using substring rather than similarity.

>
> > pg_trgm also provides `<->` but it seems this operator doesn't use
> indexes
> > either. It seems the shortest path to per-query thresholds, without
> > compromising the design, is making this operator use the index.
>
> To be efficient, the index has to know where to start and stop its
> scan. Such a thing doesn't exists for "a<->b", there is no stopping
> point. It would have to be "a <-> b < c", at which point you are back
> to ternary again.
>
Sorry, that was sloppy thinking on my end. I was fixated on looking for
something simple and saw things that aren't there. ;-) You're absolutely
right that we're back to ternary with `a <-> b < c`.

>
> > Please help
> > me understand whether my reasoning is correct. If it is, I'd appreciate a
> > high-level overview of what needs to be done. I can block a few hours to
> > work on this in the upcoming weeks.
>
> I think that you are proposing rewriting the indexing architecture
> from scratch. Doing that in a way that is robust and tested and
> backward compatible would probably take years, not hours, for a
> single individual who is not already extensively experienced in the
> internals of the code. And, it is not really clear even what you want
> to rewrite it into.
>
No rewrites, please. ;-) I was just confused. I'm looking for a simple,
incremental contribution, not a rewrite.

>
> Tom's proposal is to do it in a way that works with the existing
> architecture, rather than trying to re-write that architecture.
> (Which could probably not be done in a few hours, either, especially
> not once you write documentation and upgrade scripts and all the other
> stuff that goes into production-quality code.)
>
Thanks for the info. I blocked a few hours for open source work every week
so whether it takes 10 or 40 hours shouldn't be a problem.

>
> I don't know if this would even be appropriate as an addition to
> pg_trgm. We might want to fork that code instead. That would be a
> shame, because the underlying c code would be the fundamentally the
> same, but the alternative would be to force people who like % and
> set_limit() to carry around the baggage of new operators and types
> they have no interest in using, and vice versa. True, we did just add
> several new functions and operators to pg_trgm that many people will
> have no interest in, so maybe that is not a big deal.
>
> Cheers,
>
> Jeff
>

Would this be a better plan then:

1. Add support for trigram operators.
2. Implement `issimilar(lhs, rhs, threshold)`.
3. Add `issimilar` to the trigram operator classes.

PS Should we move this discussion to pgsql-hackers?
--
Greg Navis
I help tech companies to scale Heroku-hosted Rails apps.
Free, biweekly scalability newsletter for SaaS CEOs
<http://www.gregnavis.com/newsletter/>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Artur Zakirov 2016-06-08 10:10:37 Re: [pg_trgm] Making similarity(?, ?) < ? use an index
Previous Message Tatsuo Ishii 2016-06-08 07:24:47 Re: High availability and load balancing ...