From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Jan Urbański <wulczer(at)wulczer(dot)org> |
Cc: | Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: tsvector pg_stats seems quite a bit off. |
Date: | 2010-05-29 15:09:13 |
Message-ID: | 19353.1275145753@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(at)wulczer(dot)org> writes:
> Now I tried to substitute some numbers there, and so assuming the
> English language has ~1e6 words H(W) is around 6.5. Let's assume the
> statistics target to be 100.
> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
> probably be stopwords, so we will never see them in the input.
> Using the above estimate s ends up being 6.5/(100 + 10) = 0.06
There is definitely something wrong with your math there. It's not
possible for the 100'th most common word to have a frequency as high
as 0.06 --- the ones above it presumably have larger frequencies,
which makes the total quite a lot more than 1.0.
For the purposes here, I think it's probably unnecessary to use the more
complex statements of Zipf's law. The interesting property is the rule
"the k'th most common element occurs 1/k as often as the most common one".
So if you suppose the most common lexeme has frequency 0.1, the 100'th
most common should have frequency around 0.0001. That's pretty crude
of course but it seems like the right ballpark.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2010-05-29 15:12:40 | Re: tsvector pg_stats seems quite a bit off. |
Previous Message | Tom Lane | 2010-05-29 14:31:06 | Re: pg_trgm |