Re: patch submission: truncate trailing nulls from heap rows to reduce the size of the null bitmap

From: Jameison Martin <jameisonb(at)yahoo(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: patch submission: truncate trailing nulls from heap rows to reduce the size of the null bitmap
Date: 2012-04-26 18:59:01
Message-ID: 1335466741.63867.YahooMailNeo@web39401.mail.mud.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Simon and Greg,

The math on space savings is assuming that columns will be used roughly from first to last as declared in the DDL, not a random distribution of column values. This is the case for the particular schema that I'm looking at. I'm not asserting that it is the common case in general, though it may be more common than not given the fact that several commercial databases optimize for trailing null column values and developers often pay attention to this.

If there is a exact standard as to how this group does performance analysis (e.g. removing outliers beyond a certain standard deviation, number of repetitions, machine isolation requirements and so forth), please let me know. I can submit my results as is but in the interest of avoiding a lot of duplicate postings perhaps someone can point me to an example of what kinds of numbers are desired so I can make sure my posting conforms to that. For what it is worth I ran the 3 tests 10 times each and removed the outliers, but I can run 100 times or do something different if need be (e.g. post a csv for easy consumption in a spreadsheet). Also, Simon, you mentioned posting "environment notes", can you let me know what kind of environment notes are desired? For example, are you thinking about changes to the vanilla postgresql.conf, hardware information, OS config, etc?

Greg, all I'm trying to establish is that this change doesn't hurt insert performance for the common case as per Tom's comments. I'll try to add some additional test cases with varying trailing null column values to see if we can establish the potential salutary effect with a bit more data, but I'm not actually asserting that this significant or is a justification for the patch. It would be interesting to see what the performance benefit is with real queries against rows that have much smaller bitmaps, but I'd prefer not to get into that.

As for proof of the size reduction, I'd actually like to codify something in a regression test to ensure there are no regressions in the behavior of the patch. I was a little leery of creating a regression test that is dependent on internals that might cause the test to break over time, so I punted on it. Does anyone have a good suggestion as to a safe way to codify that the proposed behavioral change is working as intended in the form of a test that is unlikely to break over time? The best thing I could come up with was to create a very wide table and insert some sparse rows (trailing nulls) and verify that the pages. In any event, I'll also post a comparative relation size number and test as well.

Cheers.

-Jamie

________________________________
From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Jameison Martin <jameisonb(at)yahoo(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>; "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Sent: Thursday, April 26, 2012 12:27 AM
Subject: Re: [HACKERS] patch submission: truncate trailing nulls from heap rows to reduce the size of the null bitmap

On Thu, Apr 26, 2012 at 1:35 AM, Jameison Martin <jameisonb(at)yahoo(dot)com> wrote:
> Tom, I whipped up some  INSERT/SELECT tests where I selected into a
> temporary table as you suggested. The target temporary table and the source
> table were in cache and I basically disabled things that would cause noise.
> The source table had 5 integer columns, and was populated with 10 million
> rows.
>
> I tried 3 variations:
>   1) target has all nullable columns, all set to non null values: the
> results were the same
>   2) target has all nullable columns, only the first column is set: the
> patch was slightly faster
>   3) target has all non-null columns: the patch maybe was slightly faster,
> probably not statistically relevant
>
> By slightly faster I'm talking on order of 10 nanoseconds per row.
>
> I think #2 is explained by the reduction in loop iterations in
> heap_fill_tuple().

I see this as a useful use case that I have come across in a few
cases, most typically associated with very large databases.

It will be a win in those cases, but I think your maths is unrealistic
for the common case. In your case, you're saying that you have 750
trailing null columns that will be all-NULL in 90% of cases. Given a
randomly distributed set of col values, I'd expect the last NULL to be
on average around the 400th column, perhaps more. So the savings are
still high, but not as high in the general case as it is for you.

The performance tests Tom asks for are essential, otherwise we cannot
proceed. Thanks for starting those.

Please post your test code, any environment notes and your exact test
results. The important point is that we need objectively confirmable
tests, not just your word it was faster. Everybody is held to the same
level of proof here, so its not a personal doubt.

It would be useful to post sizes of databases also, to confirm that
the patch really does reduce database size.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2012-04-26 19:01:42 Re: ExceptionalCondition() return type
Previous Message Robert Haas 2012-04-26 18:54:32 Re: Request to add options to tools/git_changelog