From: | Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl> |
---|---|
To: | Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Google Summer of Code 2008 |
Date: | 2008-03-08 18:50:02 |
Message-ID: | 47D2DFDA.5010302@students.mimuw.edu.pl |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Oleg Bartunov wrote:
> Jan,
>
> the problem is known and well requested. From your promotion it's not
> clear what's an idea ?
>> Tom Lane wrote:
>>> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
>>> writes:
>>>> 2. Implement better selectivity estimates for FTS.
OK, after reading through the some of the code the idea is to write a
custom typanalyze function for tsvector columns. It could look inside
the tsvectors, compute the most commonly appearing lexemes and store
that information in pg_statistics. Then there should be a custom
selectivity function for @@ and friends, that would look at the lexemes
in pg_statistics, see if the tsquery it got matches some/any of them and
return a result based on that.
I have a feeling that in many cases identifying the top 50 to 300
lexemes would be enough to talk about text search selectivity with a
degree of confidence. At least we wouldn't give overly low estimates for
queries looking for very popular words, which I believe is worse than
givng an overly high estimate for a obscure query (am I wrong here?).
Regards,
Jan
--
Jan Urbanski
GPG key ID: E583D7D2
ouden estin
From | Date | Subject | |
---|---|---|---|
Next Message | Oleg Bartunov | 2008-03-08 19:29:36 | Re: Google Summer of Code 2008 |
Previous Message | Bruce Momjian | 2008-03-08 17:55:04 | Re: Simplifying Text Search |