Re: Reducing data type space usage

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reducing data type space usage
Date: 2006-09-16 21:48:03
Message-ID: 200609162148.k8GLm3x06774@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Gregory Stark wrote:
> Bruce Momjian <bruce(at)momjian(dot)us> writes:
>
> > Tom Lane wrote:
> >> Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> >> > The user would have to decide that he'll never need a value over 127 bytes
> >> > long ever in order to get the benefit.
> >>
> >> Weren't you the one that's been going on at great length about how
> >> wastefully we store CHAR(1) ? Sure, this has a somewhat restricted
> >> use case, but it's about as efficient as we could possibly get within
> >> that use case.
>
> Sure, this helps with CHAR(1) but there were plen

OK. One thing that we have to remember is that the goal isn't to
squeeze every byte out of the storage format. That would be
inefficient, performance-wise. We need just a reasonble storage layout.

> > To summarize what we are now considering:
> >
> > Originally, there was the idea of doing 1,2, and 4-byte headers. The
> > 2-byte case is probably not worth the extra complexity (saving 2 bytes
> > on a 128-byte length isn't very useful).
>
> Well don't forget we virtually *never* use more than 2 bytes out of the 4 byte
> headers for on-disk data. The only way we ever store a datum larger than 16k
> is you compile with 32k blocks *and* you explicitly disable toasting on the
> column.

Well, if we went with 2-byte, then we are saying we are not going to
store the TOAST length in the heap header, but store it somewhere else,
probably in TOAST. I can see how that could be done. This would leave
us with 0, 1, and 2-byte headers, and 4-byte headers in TOAST. Is that
something to consider? I think one complexity is that we are going to
need 4-byte headers in the backend to move around values, so there is
going to need to be a 2-byte to 4-byte mapping for all data types, not
just the short ones. If this only applies to TEXT, bytea, and a few
other types, it is uncertain whether it is worth it.

(We do store the TOAST length in heap, right, just before the TOAST
pointer?)

> Worse, if we don't do anything about fields like text it's not true that this
> only occurs on 128-byte columns and larger. It occurs on any column that
> *could* contain 128 bytes or more. Ie, any column declared as varchar(128)
> even if it contains only "Bruce" or any column declared as text or numeric.

Well, if you are using TEXT, it is hard to say you are worried about
storage size. I can't imagine many one-byte values are stored in TEXT.

> I'm not sure myself whether the smallfoo data types are a bad idea in
> themselves though. I just think it probably doesn't replace trying to shorten
> the largefoo varlena headers as well.

See above. Using just 2-byte headers in heap is a possibility. I am
just not sure if the overhead is worth it. With the 0-1 header, we
don't have any backend changes as data is passed around from the disk to
memory. Doing the 2-byte header would require that.

> Part of the reason I think the smallfoo data types may be a bright idea in
> their own right is that the datatypes might be able to do clever things about
> their internal storage. For instance, smallnumeric could use base 100 where
> largenumeric uses base 10000.

I hardly think modifying the numeric routines to do a two different
bases is worth it.

> > I am slightly worried about having short version of many of our types.
> > Not only char, varchar, and text, but also numeric. I see these varlena
> > types in the system:
>
> I think only the following ones make sense for smallfoo types:
>
> > bpchar
> > varchar
> > bit
> > varbit
> > numeric

OK, bit and numeric are ones we didn't talk about yet.

> These don't currently take typmods so we'll never know when they could use a
> smallfoo representation, it might be useful if they did though:
>
> > bytea
> > text
> > path
> > polygon

Good point.

>
>
> Why are these varlena? Just for ipv6 addresses? Is the network mask length not
> stored if it's not present? This gives us a strange corner case in that ipv4
> addresses will *always* fit in the smallfoo data type and ipv6 *never* fit.
> Ie, we'll essentially end up with an ipv4inet and an ipv6inet. Sad in a way.
>
> > inet
> > cidr

Yes, I think so.

>
> I have to read up on what this is.
>
> > refcursor
>
>
> > Are these shorter headers going to have the same alignment requirements
> > as the 4-byte headers? I am thinking not, meaning we will not have as
> > much padding overhead we have now.
>
> Well a 1-byte length header doesn't need any alignment so they would have only
> the alignment that the data type itself declares. I'm not sure how interacts
> with heap_deform_tuple but it's probably simpler than finding out only once
> you parse the length header what alignment you need.

That is as big a win as the shorter header. Doing a variable length
header with big-endian encoding and stuff would be a mess, for sure.
With 0-1 header, your alignment doesn't need to change from the disk to
memory.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2006-09-16 21:51:09 Re: New version of money type
Previous Message Theo Schlossnagle 2006-09-16 21:37:41 Re: New version of money type