From: | Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> |
---|---|
To: | Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Google Summer of Code 2008 |
Date: | 2008-03-08 19:29:36 |
Message-ID: | Pine.LNX.4.64.0803082219280.10010@sn.sai.msu.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sat, 8 Mar 2008, Jan Urbaski wrote:
> Oleg Bartunov wrote:
>> Jan,
>>
>> the problem is known and well requested. From your promotion it's not
>> clear what's an idea ?
>>> Tom Lane wrote:
>>>> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
>>>> writes:
>>>>> 2. Implement better selectivity estimates for FTS.
>
> OK, after reading through the some of the code the idea is to write a custom
> typanalyze function for tsvector columns. It could look inside the tsvectors,
> compute the most commonly appearing lexemes and store that information in
> pg_statistics. Then there should be a custom selectivity function for @@ and
> friends, that would look at the lexemes in pg_statistics, see if the tsquery
> it got matches some/any of them and return a result based on that.
such function already exists, it's ts_stat(). The problem with ts_stat() is
its performance, since it sequentually scans ALL tsvectors. It's possible to
write special function for tsvector data type, which will be used by
analyze, but I'm not sure sampling is a good approach here.
The way we could improve performance of gathering stats using ts_stat() is
to process only new documents. It may be not as fast as it looks because of
lot of updates, so one need to think more about.
>
> I have a feeling that in many cases identifying the top 50 to 300 lexemes
> would be enough to talk about text search selectivity with a degree of
> confidence. At least we wouldn't give overly low estimates for queries
> looking for very popular words, which I believe is worse than givng an overly
> high estimate for a obscure query (am I wrong here?).
Unfortunately, selectivity estimation for query is much difficult than
just estimate frequency of individual word.
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru)
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2008-03-08 20:13:18 | Re: Google Summer of Code 2008 |
Previous Message | Jan Urbański | 2008-03-08 18:50:02 | Re: Google Summer of Code 2008 |