Re: Full Text Search ideas

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Howard Rogers <hjr(at)diznix(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Full Text Search ideas
Date: 2010-07-20 11:00:22
Message-ID: Pine.LNX.4.64.1007201458380.32129@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

It's doable. but requires a lot of work. We need support for this.

Oleg
On Sun, 18 Jul 2010, Howard Rogers wrote:

> I asked recently about a performance problem I'd been having with some
> full text queries, and got really useful help that pointed me to the
> root issues. Currently, I'm trying to see if our document search
> (running on Oracle Text) can be migrated to PostgreSQL, and the reason
> I asked that earlier question points to a fundamental design issue
> we'll have with PostgreSQL that doesn't affect us in Oracle (not, I
> hasten to add, that that means Oracle is better/right-er/whatever.
> It's just different -but the difference will cause us a problem).
>
> Consider the following example (which is just one of 40-odd I could
> have picked).
>
> Some of our documents are in panoramic format, for example. But not
> many (say, 30,000 out of 10,000,000). We have a flag for 'panoramic',
> called 'sb12'. It's either 'y' or 'n' for any document. So a search
> for 'sb12n' (find me all documents which are not panoramic) is
> logically the same as a search for 'not sb12y'. However, 95% or more
> of documents will be an sb12n, because hardly any documents are
> panoramic in the first place. So. although the numeric outcome of
> 'sb12n' and 'not sb12y' will always be the same, you would have to
> check the entire table to find which ones are 'sb12n' (because most
> documents are marked that way), whereas you'd only have to check the
> 5% of records to find 'sb12y', because so few are marked that way.
>
> But in Oracle Text, this doesn't seem to happen:
>
> SQL> select count(*) from search_digital_rm where
> contains(textsearch,'bat and sb12n')>0;
>
>  COUNT(*)
> ----------
>      3040
>
> Elapsed: 00:00:00.10
>
> SQL> select count(*) from search_digital_rm where
> contains(textsearch,'bat not sb12y')>0;
>
>  COUNT(*)
> ----------
>      3040
>
> Elapsed: 00:00:00.06
>
> In both cases, the same number of records are returned. But, within a
> margin of error, the time taken to do each test is about the same.
> Even though the first test must be matching 'sb12n' for many millions
> of records, it's taking not much longer than the search for 'sb12y',
> which can only match about 90,000. It would seem (I can't tell from
> the explain plan itself) as though what's happened is that the set of
> 'bat' records has been fetched first (about 8000 in all). For so few
> records, whether you're looking for sb12y or sb12n then becomes mostly
> irrelevant for timing purposes, and hence the duration equivalence of
> both queries.
>
> This is not what happens in PostgreSQL, however (as I now know, thanks
> to the help on my question from a couple of days ago):
>
> ims=# select count(*) from search_rm
> where to_tsvector('english', textsearch) @@ to_tsquery('english','bat & sb12n');
> count
> -------
>  3849
> (1 row)
>
> Time: 408.962 ms
>
> ims=# select count(*) from search_rm
> where to_tsvector('english', textsearch) @@ to_tsquery('english','bat
> & !sb12y');
> count
> -------
>  3849
> (1 row)
>
> Time: 11.533 ms
>
> Now, one test takes about 40 times longer than the other, though the
> one taking just 11ms is as fast as Oracle can manage (impressive,
> considering I've done absolutely nothing to tune this PostgreSQL
> testbed as yet!). Logically equivalent the two tests may be, but
> hunting through lots of sb12n records and working out which are
> related to bats is apparently a lot slower than finding things the
> other way around, it would seem.
>
> I'm wondering firstly if there's any way I can configure PostgreSQL
> FTS so that it produces the sort of results we've gotten used to from
> Oracle, i.e., where search speeds do not go up wildly when a 'search
> term' is applied that happens to be used by the vast majority of
> document records. (For example, we currently allows searches for file
> types, where 80% of documents would be "word documents", another 19%
> would be PDFs and the remaining 1% of documents could be pretty much
> anything else! We can't have people searching for "definitely want
> only Word documents" if that means matching 8 million records and
> search speeds shoot to the stratosphere as a result).
>
> Secondly, I'm open to any suggestions as to how you would organise
> things or re-write the SQL so that the "attribute filter" is only
> applied to the small subset of records which match the relevant "real
> word" search term, if that's what's needed here. In other words, is my
> best bet in the earlier examples to fetch *all* "bat" records, and
> then nest that query within an outer query that adds a test for a
> separate attribute column? Or is something else called for here?
>
> Would appreciate any thoughts on the subject!
>
> Regards
> HJR
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru)
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Rafael Comino Mateos 2010-07-20 11:13:40 Invitación a conectarnos en LinkedIn
Previous Message Prometheus Prometheus 2010-07-20 10:27:00 what do i need to know about array index?