From: | Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: gsoc, text search selectivity and dllist enhancments |
Date: | 2008-07-11 06:18:25 |
Message-ID: | 4876FB31.8010803@students.mimuw.edu.pl |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Tom Lane wrote:
> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl> writes:
>> Tom Lane wrote:
> Well, (1) the normal measure would be statistics_target *tsvectors*,
> and we'd have to translate that to lexemes somehow; my proposal is just
> to use a fixed constant instead of tsvector width as in your original
> patch. And (2) storing only statistics_target lexemes would be
> uselessly small and would guarantee that people *have to* set a custom
> target on tsvector columns to get useful results. Obviously broken
> defaults are not my bag.
Fair enough, I'm fine with a multiplication factor.
>> Also, the existing code decides which elements are worth storing as most
>> common ones by discarding those that are not frequent enough (that's
>> where num_mcv can get adjusted downwards). I mimicked that for lexemes
>> but maybe it just doesn't make sense?
>
> Well, that's not unreasonable either, if you can come up with a
> reasonable definition of "not frequent enough"; but that adds another
> variable to the discussion.
The current definition was "with more occurrences than 0.001 of total
rows count, but no less than 2". Copied right off
compute_minimal_stats(), I have no problem with removing it. I think its
point is to guard you against a situation where all elements are more or
less unique, and taking the top N would just give you some random noise.
It doesn't hurt, so I'd be for keeping the mechanism, but if people feel
different, then I'll just drop it.
--
Jan Urbanski
GPG key ID: E583D7D2
ouden estin
From | Date | Subject | |
---|---|---|---|
Next Message | Jan Urbański | 2008-07-11 06:23:05 | Re: gsoc, text search selectivity and dllist enhancments |
Previous Message | Gurjeet Singh | 2008-07-11 04:23:17 | Postgres 8.1 doesn't like pg_standby's -l option |