Re: Should heapam_estimate_rel_size consider fillfactor?

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Peter Eisentraut <peter(at)eisentraut(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Should heapam_estimate_rel_size consider fillfactor?
Date: 2023-07-03 09:40:58
Message-ID: 60eaa69b-1c57-67ed-3aee-7600db134e42@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/3/23 09:00, Tomas Vondra wrote:
>
>
> On 7/3/23 08:46, Peter Eisentraut wrote:
>> On 14.06.23 20:41, Corey Huinker wrote:
>>>     So maybe we should make table_block_relation_estimate_size smarter to
>>>     also consider the fillfactor in the "no statistics" branch, per the
>>>     attached patch.
>>>
>>>
>>> I like this a lot. The reasoning is obvious, the fix is simple,it
>>> doesn't upset any make-check-world tests, and in order to get a
>>> performance regression we'd need a table whose fillfactor has been
>>> changed after the data was loaded but before an analyze happens, and
>>> that's a narrow enough case to accept.
>>>
>>> My only nitpick is to swap
>>>
>>>     (usable_bytes_per_page * fillfactor / 100) / tuple_width
>>>
>>> with
>>>
>>>     (usable_bytes_per_page * fillfactor) / (tuple_width * 100)
>>>
>>>
>>> as this will eliminate the extra remainder truncation, and it also
>>> gets the arguments "in order" algebraically.
>>
>> The fillfactor is in percent, so it makes sense to divide it by 100
>> first before doing any further arithmetic with it.  I find your version
>> of this to be more puzzling without any explanation.  You could do
>> fillfactor/100.0 to avoid the integer division, but then, the comment
>> above that says "integer division is intentional here", without any
>> further explanation.  I think a bit more explanation of all the
>> subtleties here would be good in any case.
>>
>
> Yeah, I guess the formula should be doing (fillfactor / 100.0).
>
> The "integer division is intentional here" comment is unrelated to this
> change, it refers to the division by "tuple_width" (and yeah, it'd be
> nice to have it explain why it's intentional).
>

FWIW the reason why the integer division is intentional is most likely
that we want "floor" semantics - if there's 10.23 rows per page, that
really means 10 rows per page.

I doubt it makes a huge difference in this particular place, considering
we're calculating the estimate from somewhat unreliable values, and then
use it for rough estimate of relation size.

But from this POV, I think it's more correct to do it "my" way:

density = (usable_bytes_per_page * fillfactor / 100) / tuple_width;

because that's doing *two* separate integer divisions, with floor
semantics. First we calculate "usable bytes" (rounded down), then
average number of rows per page (also rounded down).

Corey's formula would do just one integer division. I don't think it
makes a huge difference, though. I mean, it's just an estimate and so we
can't expect to be 100% accurate.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2023-07-03 09:41:07 Re: Improve list manipulation in several places
Previous Message Julien Rouhaud 2023-07-03 09:24:23 Re: Outdated description of PG_CACHE_LINE_SIZE