Quick Links

Re: tsvector pg_stats seems quite a bit off.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jan Urbański <wulczer(at)wulczer(dot)org>
Cc:	Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-29 15:12:40
Message-ID:	19403.1275145960@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(at)wulczer(dot)org> writes:
> Hm, I am now thinking that maybe this theory is flawed, because tsvecors
> contain only *unique* words, and Zipf's law is talking about words in
> documents in general. Normally a word like "the" would appear lots of
> times in a document, but (even ignoring the fact that it's a stopword
> and so won't appear at all) in a tsvector it will be present only once.
> This may or may not be a problem, not sure if such "squashing" of
> occurences as tsvectors do skewes the distribution away from Zipfian or not.

Well, it's still going to approach Zipfian distribution over a large
number of documents. In any case we are not really depending on Zipf's
law heavily with this approach. The worst-case result if it's wrong
is that we end up with an MCE list shorter than our original target.
I suggest we could try this and see if we notice that happening a lot.

regards, tom lane

In response to

Re: tsvector pg_stats seems quite a bit off. at 2010-05-29 13:56:57 from Jan Urbański

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Jan Urbański	2010-05-29 15:16:35	Re: tsvector pg_stats seems quite a bit off.
Previous Message	Tom Lane	2010-05-29 15:09:13	Re: tsvector pg_stats seems quite a bit off.