| From: | Noah Misch <noah(at)leadboat(dot)com> | 
|---|---|
| To: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> | 
| Cc: | Nathan Boley <npboley(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> | 
| Subject: | Re: Collect frequency statistics for arrays | 
| Date: | 2012-01-06 20:16:35 | 
| Message-ID: | 20120106201635.GA3520@tornado.leadboat.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Corrections:
On Thu, Dec 29, 2011 at 11:35:00AM -0500, Noah Misch wrote:
> On Wed, Nov 09, 2011 at 08:49:35PM +0400, Alexander Korotkov wrote:
> > +  *	We set s to be the estimated frequency of the K'th element in a natural
> > +  *	language's frequency table, where K is the target number of entries in
> > +  *	the MCELEM array. We assume that the distribution of element frequencies
> > +  *	follows Zipf's law with an exponent of 1.
> > +  *
> > +  *	Assuming Zipfian distribution, the frequency of the K'th element is equal
> > +  *	to 1/(K * H(W)) where H(n) is 1/2 + 1/3 + ... + 1/n and W is the number of
> > +  *	elements in the language.	Putting W as one million, we get roughly
> > +  *	0.07/K. This gives s = 0.07/K.	We set epsilon = s/10, which gives bucket
> > +  *	width w = K/0.007 and maximum expected hashtable size of about 1000 * K.
> 
> These last two paragraphs, adapted from ts_typanalyze.c, assume natural
> language documents.  To what extent do these parameter choices remain sensible
> for arbitrary data such as users may place in arrays?  In any event, we need a
> different justification, even if it's just a hand-wavy justification.
> 
> If I'm following this correctly, this choice of "s" makes the algorithm
> guaranteed to find only elements constituting >= 7% of the input elements.
> Incidentally, isn't that far too high for natural language documents?  If the
> English word "the" only constitutes 7% of typical documents, then this "s"
> value would permit us to discard practically every word; we'd be left with
> words read while filling the last bucket and words that happened to repeat
> exceedingly often in the column.  I haven't tried to make a test case to
> observe this problem; am I missing something?  (This question is largely
> orthogonal to your patch.)
No, we'll find elements of frequency at least 0.07/(default_statistics_target
* 10) -- in the default configuration, 0.007%.  Also, ts_typanalyze() counts
the number of documents that contain one or more instances of each lexeme,
ignoring the number of appearances within each document.  The word "the" may
constitute 7% of a typical document, but it will appear at least once in
nearly 100% of documents.  Therefore, this "s" value is adequate even for the
pathological case of each "document" containing just one lexeme.
> > +  *
> > +  *	Note: in the above discussion, s, epsilon, and f/N are in terms of a
> > +  *	element's frequency as a fraction of all elements seen in the input.
> > +  *	However, what we actually want to store in the finished pg_statistic
> > +  *	entry is each element's frequency as a fraction of all rows that it occurs
> > +  *	in. Elements might be repeated in the same array. Since operators
> > +  *	<@, &&, @> takes care only about element occurence itself and not about
> > +  *	occurence count, function takes additional care about uniqueness of
> > +  *	counting. Also we need to change the divisor from N to nonnull_cnt to get
> > +  *	the number we want.
> 
> On the same tangent, why does ts_typanalyze() not deduplicate the same way?
> The @@ operator has the same property.
Answer: to_tsvector() will have already done so.
> > + 	/*
> > + 	 * We set bucket width equal to (num_mcelem + 10) / 0.007 as per the
> > + 	 * comment above.
> > + 	 */
> > + 	bucket_width = num_mcelem * 1000 / 7;
> 
> The addend mentioned is not present in the code or discussed in "the comment
> above".  (I see the comment is copied verbatim from ts_typanalyze(), where the
> addend *is* present, though again the preceding comment says nothing of it.)
The addend rationale in fact does appear in the ts_typanalyze() comment.
Thanks,
nm
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Simon Riggs | 2012-01-06 20:47:12 | Re: 16-bit page checksums for 9.2 | 
| Previous Message | Andres Freund | 2012-01-06 20:03:49 | Re: 16-bit page checksums for 9.2 |