From: | John Naylor <jcnaylor(at)gmail(dot)com> |
---|---|
To: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: MCV lists for highly skewed distributions |
Date: | 2017-12-29 14:05:28 |
Message-ID: | CAJVSVGWm9QgkbpgbZ74GCLD=UdSrYo4aq+tBdAU9foR-N+1NNw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 12/28/17, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> I want to revive a patch I sent couple years ago to the performance list,
> as I have seen the issue pop up repeatedly since then.
> If we stored just a few more values, their inclusion in the MCV would mean
> they are depleted from the residual count, correctly lowering the estimate
> we would get for very rare values not included in the sample.
>
> So instead of having the threshold of 1.25x the average frequency over all
> values, I think we should use 1.25x the average frequency of only those
> values not already included in the MCV, as in the attached.
After reading the thread you mentioned, I think that's a good approach.
On master, the numerator of avg_count is nonnull_cnt, but you've
(indirectly) set it to values_cnt. You didn't mention it here, but I
think you're right and master is wrong, since in this function only
sortable values go through the MCV path. If I understand the code
correctly, if there are enough too-wide values in the sample, they
could bias the average and prevent normal values from being considered
for the MCV list. Am I off base here?
Anyway, since you're now overwriting ndistinct_table, I would rename
it to ndistinct_remaining for clarity's sake.
This part doesn't use any loop variables, so should happen before the loop:
+ if (num_mcv > track_cnt)
+ num_mcv = track_cnt;
I think this comment
/* estimate # of occurrences in sample of a typical nonnull value */
belongs just above the calculation of avg_count.
> I think that perhaps maxmincount should also use the dynamic
> values_cnt_remaining rather than the static one. After all, things
> included in the MCV don' get represented in the histogram. When I've seen
> planning problems due to skewed distributions I also usually see redundant
> values in the histogram boundary list which I think should be in the MCV
> list instead. But I have not changed that here, pending discussion.
I think this is also a good idea, but I haven't thought it through. If
you don't go this route, I would move this section back out of the
loop as well.
-John Naylor
From | Date | Subject | |
---|---|---|---|
Next Message | David Steele | 2017-12-29 14:11:12 | Re: Basebackups reported as idle |
Previous Message | Vladimir Svedov | 2017-12-29 13:57:33 | array_ndims never returns zero |