Quick Links

Re: tsvector pg_stats seems quite a bit off.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jan Urbański <wulczer(at)wulczer(dot)org>
Cc:	Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-28 20:22:25
Message-ID:	7523.1275078145@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(at)wulczer(dot)org> writes:
> We follow the algorithm as written, the trouble starts when we want to
> output the result. The paper says which items from the D structure
> should be returned when the user asks for items that have frequencies
> higher than a threshold s. What we want to put in the statistics table
> are items accompanied by their frequencies, so we need to do some
> reasoning before we can construct the result.

Well, the estimated frequency is still just f/N. The point is that we
must filter out items with small f values because they're probably
inaccurate --- in particular, anything with f < eN is completely
untrustworthy.

I agree that we currently aren't bothering to determine a specific s
value, but we probably need to do that in order to have a clear
understanding of what we are getting.

The idea that I was toying with is to assume a Zipfian distribution of
the input (with some reasonable parameter), and use that to estimate
what the frequency of the K'th element will be, where K is the target
number of MCV entries or perhaps a bit more. Then use that estimate as
the "s" value, and set e = s/10 or so, and then w = 1/e and continue as
per the paper. If the eventual filtering results in a lot less than the
target number of MCV entries (because the input wasn't so Zipfian), we
lose, but at least we have accurate numbers for the entries we kept.

> I we should definitely prune the table one last time in the very
> probable case that the loop ended in the middle of two regularly
> happening prunes.

The paper doesn't say that you need to do that. I suspect if you work
through the math, you'll find out that the minimum-f filter skips
anything that would have been pruned anyway.

regards, tom lane

In response to

Re: tsvector pg_stats seems quite a bit off. at 2010-05-28 20:04:07 from Jan Urbański

Responses

Re: tsvector pg_stats seems quite a bit off. at 2010-05-28 21:47:28 from Jan Urbański

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Pavel Stehule	2010-05-28 20:23:46	Re: How to pass around collation information
Previous Message	Robert Haas	2010-05-28 20:15:27	Re: How to pass around collation information