Quick Links

Re: Column as arrays.. more efficient than columns?

From:	"Ow Mun Heng" <ow(dot)mun(dot)heng(at)wdc(dot)com>
To:	"Merlin Moncure" <mmoncure(at)gmail(dot)com>, "Ron Johnson" <ron(dot)l(dot)johnson(at)cox(dot)net>
Cc:	<pgsql-general(at)postgresql(dot)org>
Subject:	Re: Column as arrays.. more efficient than columns?
Date:	2007-09-07 14:08:34
Message-ID:	88C32A5D9FC71B4BB1B911B7B2104969C62105@wdmyexbe03.my.asia.wdc.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On 9/7/07, Ron Johnson <ron(dot)l(dot)johnson(at)cox(dot)net> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----

> On 09/06/07 20:53, Merlin Moncure wrote:
> [snip]
> >
> > arrays are interesting and have some useful problems. however, we
> > must first discuss the problems...first and foremost if you need to
> > read any particular item off the array you must read the entire array
> > from disk and you must right all items back to disk for writes.
>
> Reads and writes are done at the page level, so I'm not sure this is
> valid.

>>sure it is...since the denormalized record is much larger (especially
>>in array scenarios), the tuple is much larger meaning the page will
>>fill up much more quickly meaning more dead pages, more vacuuming,
>>etc. Besides that, the server has to do some work presenting the
>>array as part of the read which is overhead. I didn't go into a lot
>>of detail but the reasoning is sound. Here is a quick example showing
>>the problem.

[snip]

data warehouse, so once the data has been denormalised, no need to be
updated again, so i would think that merlin's tps doesn't really take into
account.

Anyway.. here are some stats on the table I was working on.

denormalising the table reduced the # of rows quite a bit. and the time
taken to return 2.8K rows worth of results is only ~2s on the denormalised
table vs.76secs on the original table.
AFAICT, this shows me better performance in terms of read-back.

comments please since I may not necessary know what I'm doing.

BTW, I don't really get all the talk about the dead-tuples, pages filing up
more quickly etc..(BTW, data below is based on denormalising the table into
column forms rather than as an array)

normalised table = 8 million
denormalised table = 328K
Join table sfoo = 3.6million
join table dbar = 1.5million
join table smallfoo = 108rows

Nested Loop (cost=0.00..15022.75 rows=1 width=280) (actual
time=0.597..1345.239 rows=2872 loops=1)
Join Filter: ((dbar.famid)::text = (fam.famid)::text)
-> Nested Loop (cost=0.00..15017.32 rows=1 width=274) (actual
time=0.310..247.265 rows=2872 loops=1)
Join Filter: ((sfoo.date_time = denorm.date_time) AND (sfoo.ttype =
denorm.ttype))
-> Nested Loop (cost=0.00..5767.36 rows=71 width=281) (actual
time=0.246..85.985 rows=2872 loops=1)
-> Index Scan using idx_dbar on dbar (cost=0.00..1175.61
rows=332 width=28) (actual time=0.154..46.172 rows=482 loops=1)
Index Cond: ((code)::text = 'AAA71'::text)
-> Index Scan using idx_denorm on denorm (cost=0.00..13.74
rows=7 width=253) (actual time=0.017..0.055 rows=6 loops=482)
Index Cond: ((denorm.snum)::text = (dbar.snum)::text)
-> Index Scan using idx_ts_sn on sfoo (cost=0.00..129.48 rows=46
width=37) (actual time=0.010..0.022 rows=6 loops=2872)
Index Cond: ((sfoo.snum)::text = (norm.snum)::text)
-> Seq Scan on fam (cost=0.00..4.08 rows=108 width=18) (actual
time=0.004..0.169 rows=108 loops=2872)
Total runtime: 1350.234 ms

returned 2.8K rows

HashAggregate (cost=61819.46..61819.67 rows=1 width=73) (actual
time=76251.012..76586.406 rows=2872 loops=1)
-> Nested Loop (cost=20.55..61819.40 rows=1 width=73) (actual
time=140.007..36979.539 rows=57440 loops=1)
Join Filter: ((dbar.famid)::text = (fam.famid::text)
-> Nested Loop (cost=20.55..61813.97 rows=1 width=67) (actual
time=139.585..3412.300 rows=57440 loops=1)
Join Filter: (sfoo.ttype = norm.ttype)
-> Nested Loop (cost=0.00..42351.18 rows=792 width=65)
(actual time=0.117..464.893 rows=2275 loops=1)
-> Index Scan using idx_dbar on bar (cost=0.00..1175.61
rows=332 width=28) (actual time=0.058..7.275 rows=482 loops=1)
Index Cond: ((code)::text = 'AAA71'::text)
-> Index Scan using idx_sfoo on sfoo
(cost=0.00..123.45 rows=46 width=37) (actual time=0.761..0.929 rows=5
loops=482)
Index Cond: ((sfoo.snum)::text =
(dbar.snum)::text)
-> Bitmap Heap Scan on norm (cost=20.55..24.56 rows=1
width=46) (actual time=1.144..1.202 rows=25 loops=2275)
Recheck Cond: (((norm.snum)::text = (dbar.snum)::text)
AND (sfoo.date_time = norm.date_time))
-> BitmapAnd (cost=20.55..20.55 rows=1 width=0)
(actual time=0.929..0.929 rows=0 loops=2275)
-> Bitmap Index Scan on idx_norm
(cost=0.00..6.70 rows=166 width=0) (actual time=0.056..0.056 rows=142
loops=2275)
Index Cond: ((norm.snum)::text =
(dbar.snum)::text)
-> Bitmap Index Scan on idx_trz_rundate
(cost=0.00..13.40 rows=604 width=0) (actual time=0.977..0.977 rows=55
loops=2021)
Index Cond: (sfoo.date_time =
norm.date_time)
-> Seq Scan on fam fam_id (cost=0.00..4.08 rows=108 width=18)
(actual time=0.008..0.287 rows=108 loops=57440)
Total runtime: 76591.106 ms

In response to

Re: Column as arrays.. more efficient than columns? at 2007-09-07 12:49:47 from Merlin Moncure

Browse pgsql-general by date

	From	Date	Subject
Next Message	Tom Lane	2007-09-07 14:31:10	Re: What's the difference between SET STORAGE MAIN and EXTENDED?
Previous Message	Teodor Sigaev	2007-09-07 14:07:17	Re: tsearch2 anomoly?