Re: Replacing Apache Solr with Postgre Full Text Search?

From: Mike Rylander <mrylander(at)gmail(dot)com>
To: J2eeInside J2eeInside <j2eeinside(at)gmail(dot)com>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Replacing Apache Solr with Postgre Full Text Search?
Date: 2020-03-26 15:18:16
Message-ID: CAO8ar==10fY-Q+mP+krz+eMdqcQ1CtSdtHzPiC0NGj5WafWzUQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, Mar 26, 2020 at 4:03 AM J2eeInside J2eeInside
<j2eeinside(at)gmail(dot)com> wrote:
>
> Hi Mike, and thanks for valuable answer!
> In short, you think a PG Full Text Search can do the same as Apache Solr?
>

Can it? I mean, it does today. Whether it would for you depends on
your needs and how much effort you can afford to put into the stuff
that is /not/ the full text engine itself, like document normalizers
and search UIs.

There are trade-offs to be made when choosing any tool. Solr is
great, and so is Lucene (Solr's heart), and so is Elastic Search. For
that matter, Zebra is awesome for full text indexing, too. Those all
make indexing a pile of documents easy. But, none of those are great
as an authoritative data store, so for instance there will necessarily
be drift between your data and the Solr index requiring a full
refresh. It's also hard to integrate non-document filtering
requirements like I have in my use case. Both of those are important
to my use case, so PG's full text is my preference.

Solr also didn't exist (publicly) in 2004 when we started building Evergreen. :)

> P.S. I need to index .pdf, .html and MS Word .doc/.docx files, is there any constraints in Ful Text search regarding those file types?
>

It can't handle those without some help -- it supports exactly text --
but you can extract the text using other tools.

--
Mike Rylander
| Executive Director
| Equinox Open Library Initiative
| phone: 1-877-OPEN-ILS (673-6457)
| email: miker(at)equinoxinitiative(dot)org
| web: http://equinoxinitiative.org

>
> On Wed, Mar 25, 2020 at 3:36 PM Mike Rylander <mrylander(at)gmail(dot)com> wrote:
>>
>> On Wed, Mar 25, 2020 at 8:37 AM J2eeInside J2eeInside
>> <j2eeinside(at)gmail(dot)com> wrote:
>> >
>> > Hi all,
>> >
>> > I hope someone can help/suggest:
>> > I'm currently maintaining a project that uses Apache Solr /Lucene. To be honest, I wold like to replace Solr with Postgre Full Text Search. However, there is a huge amount of documents involved - arround 200GB. Wondering, can Postgre handle this efficiently?
>> > Does anyone have specific experience, and what should the infrastructure look like?
>> >
>> > P.S. Not to be confused, the Sol works just fine, i just wanted to eliminate one component from the whole system (if Full text search can replace Solr at all)
>>
>> I'm one of the core developers (and the primary developer of the
>> search subsystem) for the Evergreen ILS [1] (integrated library system
>> -- think book library, not software library). We've been using PGs
>> full-text indexing infrastructure since day one, and I can say it is
>> definitely capable of handling pretty much anything you can throw at
>> it.
>>
>> Our indexing requirements are very complex and need to be very
>> configurable, and need to include a lot more than just "search and
>> rank a text column," so we've had to build a ton of infrastructure
>> around record (document) ingest, searching/filtering, linking, and
>> display. If your indexing and search requirements are stable,
>> specific, and well-understood it should be straight forward,
>> especially if you don't have to take into account non-document
>> attributes like physical location, availability, and arbitrary
>> real-time visibility rules like Evergreen does.
>>
>> As for scale, it's more about document count than total size. There
>> are Evergreen libraries with several million records to search, and
>> with proper hardware and tuning everything works well. Our main
>> performance issue has to do with all of the stuff outside the records
>> (documents) themselves that have to be taken into account during
>> search. The core full-text search part of our queries is extremely
>> performant, and has only gotten better over the years.
>>
>> [1] http://evergreen-ils.org
>>
>> HTH,
>> --
>> Mike Rylander
>> | Executive Director
>> | Equinox Open Library Initiative
>> | phone: 1-877-OPEN-ILS (673-6457)
>> | email: miker(at)equinoxinitiative(dot)org
>> | web: http://equinoxinitiative.org

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Andreas Joseph Krogh 2020-03-26 15:32:58 Sv: Replacing Apache Solr with Postgre Full Text Search?
Previous Message Justin King 2020-03-26 14:46:47 Re: PG12 autovac issues