Re: Using array instead of sub table (storage and speed)

From: Lutz Fischer <l(dot)fischer(at)ed(dot)ac(dot)uk>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Using array instead of sub table (storage and speed)
Date: 2017-06-16 10:37:47
Message-ID: 867b4235-95ba-6910-8224-824f75f7ce99@ed.ac.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi Stephen,

Thanks for your reply. The data in the sub table (sp) are only read in
as a block. Meaning I will always read in all entries in [sp] that
belong to one entry in [s]. Meaning I would not lose much in terms of
what I could do with the data in [sp] and I could be saving around 2.8K
per entry in [s] (just counting the overhead for each tuple in [sp]) per
entry in [s]

One thing I would still wonder is in how far this would affect the
performance retrieving data from [s].

I often need some data from [s] where I don't care about [sp]. So in how
far does having these arrays a part of [s] would make these queries
slower. Or would be better to store the array data in a separate table
e.g. have [s] as it is now but turn [sp] into an array aggregated table.

Thanks,

Lutz

On 15/06/17 15:37, Stephen Frost wrote:
> Greetings,
>
> * Lutz Fischer (l(dot)fischer(at)ed(dot)ac(dot)uk) wrote:
>> Data in [sp] are never changed. I can probably reduce the size by
>> changing datatypes from numeric to float but I was wondering if it
>> would be more efficient - primarily in terms of storage - to change
>> the structure to have two arrays in [s]. E.g.
> The short answer is 'probably', but it really depends on how wide your
> rows are.
>
>> I haven't really found anything yet how much space (e.g. how many
>> bytes) an array will use compared to a table row in postgresql.
> There's a 24-byte overhead for each tuple. If the width of the tuple's
> columns ends up being less than 24 bytes then half (or more) of the
> space used is for the tuple header. Arrays have a bit of overhead
> themsleves but are then densely packed.
>
> In testing that I've done, a table which looks like:
>
> CREATE TABLE t1 (
> c1 int
> );
>
> Will end up with a couple hundred rows per 8k page (perhaps 250 or so),
> meaning that you get ~1k of actual data for 8k of storage. Changing
> this to an array, like so:
>
> CREATE TABLE t1 (
> c1 int[]
> );
>
> And then storing 3-4 tuples per 8k page (keeping each tuple under the 2k
> TOAST limit) lead to being able to store something like 450 ints per
> tuple with a subsequent result of 1800 ints per page and ~7.2k worth of
> actual data for 8k of storage, which was much more efficient for
> storage.
>
> Of course, the tuple header is actually useful data in many
> environments- if you go with this approach then you have to work out how
> to deal with the fact that a given tuple is either visible or not, and
> all the ints in the array for that tuple are all visible and that an
> update to that tuple locks the entire tuple and that set of ints, etc.
> If the data isn't changed after being loaded and you're able to load an
> entire tuple all at once then this could work.
>
> Note that arrays aren't more efficient than just using individual
> columns, and this technique is only going to be helpful if the tuple
> overhead in your situation is a large portion of the data and using this
> technique allows you to reduce the number of tuples stored.
>
> Thanks!
>
> Stephen

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Stephen Frost 2017-06-16 12:05:37 Re: Using array instead of sub table (storage and speed)
Previous Message Justin Pryzby 2017-06-16 02:53:52 Re: Re: join under-estimates with ineq conditions