Re: [HACKERS] compression in LO and other fields

From: wieck(at)debis(dot)com (Jan Wieck)
To: zakkr(at)zf(dot)jcu(dot)cz (Karel Zak - Zakkr)
Cc: wieck(at)debis(dot)com, t-ishii(at)sra(dot)co(dot)jp, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [HACKERS] compression in LO and other fields
Date: 1999-11-12 13:58:32
Message-ID: m11mHDk-0003kLC@orion.SAPserv.Hamburg.dsh.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Karel Zak - Zakkr wrote:

> On Fri, 12 Nov 1999, Jan Wieck wrote:
>
> > I already made some tests with a type I called 'lztext'
> > locally. Only the input-/output-functions exist so far and
>
> I is your original implementation or you use any current compression
> code? I try bzip2, but output from this algorithm is total binary,
> I don't know how this use in PgSQL if in backend are all routines
> (in/out) use *char (yes, I'am newbie for PgSQL hacking:-).

The internal storage format is based on an article I found
at:

http://www.neutralzone.org/home/faqsys/docs/slz_art.txt

Simple Compression using an LZ buffer
Part 3 Revision 1.d:
An introduction to compression on the Amiga by Adisak Pochanayon

Freely Distributable as long as reproduced completely.
Copyright 1993 Adisak Pochanayon

I've written the code from scratch.

The internal representation is binary, for sure. It's a
PostgreSQL variable length data format as usual.

I don't know if there's a compression library available that
fit's our need. First and most important it must have a
license that permits us to include it in the distribution
under our existing license. Second it's implementation must
not cause any problems in the backend like memory leakage or
the like.

> > The compression rates aren't that giantic. I've got 30-50%
>
> Not is problem, that your implementation compress all data at once?
> Typically compression use a stream, and compress only small a buffer
> in any cycle.

No, that's no problem. On type input, the original value is
completely in memory given as a char*, and the internal
representation is returned as a palloc()'d Datum. For output
it's vice versa.

O.K. some details on the compression rate. I've used 112
.html files with a total size of 1188346 bytes this time.
The smallest one was 131 bytes, the largest one 114549 bytes
and most of the files are somewhere between 3-12K.

Compression results on the binary level are:

gzip -9 outputs 398180 bytes (66.5% rate)

gzip -1 outputs 447597 bytes (62.3% rate)

my code outputs 529420 bytes (55.4% rate)

Html input might be somewhat optimal for Adisak's storage
format, but taking into account that my source implementing
the type input and output functions is smaller than 600
lines, I think 11% difference to a gzip -9 is a good result
anyway.

> > Sorry for the compression specific slang here. Well, anyone
> > interested in the code?
>
> Yes, for me - I finish to_char()/to_data() ora compatible routines
> (Thomas, you still quiet?) and this is new appeal for me :-)

Bruce suggested the contrib area, but I'm not sure if that's
the right place. If it goes into the distribution at all, I'd
like to use this data type for rule plan strings and function
source text in the system catalogs. I don't expect we'll have
a general solution for tuples split across multiple blocks
for v7.0. And using lztext for rules and function sources
would lower some FRP's. But using it in the catalogs requires
to be builtin.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#========================================= wieck(at)debis(dot)com (Jan Wieck) #

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 1999-11-12 14:44:03 Re: [HACKERS] compression in LO and other fields
Previous Message Tatsuo Ishii 1999-11-12 13:42:22 Re: [HACKERS] compression in LO and other fields