Re: [HACKERS] compression in LO and other fields

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: wieck(at)debis(dot)com (Jan Wieck)
Cc: zakkr(at)zf(dot)jcu(dot)cz (Karel Zak - Zakkr), t-ishii(at)sra(dot)co(dot)jp, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [HACKERS] compression in LO and other fields
Date: 1999-11-12 14:44:03
Message-ID: 26512.942417843@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

wieck(at)debis(dot)com (Jan Wieck) writes:
> Html input might be somewhat optimal for Adisak's storage
> format, but taking into account that my source implementing
> the type input and output functions is smaller than 600
> lines, I think 11% difference to a gzip -9 is a good result
> anyway.

These strike me as very good results. I'm not at all sure that using
gzip or bzip would give much better results in practice in Postgres,
because those compressors are optimized for relatively large files,
whereas a compressed-field datatype would likely be getting relatively
small field values to work on. (So your test data set is probably a
good one for our purposes --- do the numbers change if you exclude
all the files over, say, 10K?)

> Bruce suggested the contrib area, but I'm not sure if that's
> the right place. If it goes into the distribution at all, I'd
> like to use this data type for rule plan strings and function
> source text in the system catalogs.

Right, if we are going to bother with it at all, we should put it
into the core so that we can use it for rule plans.

> I don't expect we'll have
> a general solution for tuples split across multiple blocks
> for v7.0.

I haven't given up hope of that yet --- but even if we do, compressing
the data is an attractive choice to reduce the frequency with which
tuples must be split across blocks.

It occurred to me last night that applying compression to individual
fields might not be the best approach. Certainly a "bytez" data type
is the easiest thing to fit into the existing system, but it's leaving
some space savings on the table. What about compressing the *whole*
data contents of a tuple on-disk, as a single entity? That should save
more space than field-by-field compression. It could be triggered in
the tuple storage routines whenever the uncompressed size exceeds some
threshold. (We'd need a flag in the tuple header to indicate compressed
data, but I think there are bits to spare.) When we get around to
having split tuples, the code would still be useful because it'd be
applied as a first resort before splitting a large tuple; it'd reduce
the frequency of splits and the number of sections big tuples get split
into. All automatic and transparent, too --- the user doesn't have to
change data declarations at all.

Also, if we do it that way, then it would *automatically* apply to
both regular tuples and LO, because the current LO implementation is
just tuples. (Tatsuo's idea of a non-transaction-controlled LO would
need extra work, of course, if we decide that's a good idea...)

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message The Hermit Hacker 1999-11-12 14:49:31 Re: [HACKERS] compression in LO and other fields
Previous Message Jan Wieck 1999-11-12 13:58:32 Re: [HACKERS] compression in LO and other fields