Quick Links

Re: tsvector pg_stats seems quite a bit off.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jan Urbański <wulczer(at)wulczer(dot)org>
Cc:	Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-30 14:46:44
Message-ID:	5735.1275230804@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Jan =?UTF-8?Q?Urba=C5=84ski?= <wulczer(at)wulczer(dot)org> writes:
>> I think the only relevance of stopwords to the current problem is that
>> *if* stopwords have been removed, we would see a Zipfian distribution
>> with the first few entries removed, and I'm not sure if it's still
>> really Zipfian afterwards.

> That's why I was proposing to take s = 0.07 / (MCE-count + 10). But that probably doesn't matter much.

Oh, now I get the point of that. Yeah, it is probably a good idea.
If the input doesn't have stopwords removed, the worst that will happen
is we'll collect stats for an extra 10 or so lexemes, which will then
get thrown away when they don't fit into the MCE list. +1.

regards, tom lane

In response to

Re: tsvector pg_stats seems quite a bit off. at 2010-05-30 14:41:40 from Jan Urbański

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2010-05-30 14:50:19	Re: functional call named notation clashes with SQL feature
Previous Message	Tom Lane	2010-05-30 14:41:58	Re: pg_trgm