Re: [HACKERS] compression in LO and other fields

From: wieck(at)debis(dot)com (Jan Wieck)
To: tgl(at)sss(dot)pgh(dot)pa(dot)us (Tom Lane)
Cc: wieck(at)debis(dot)com, zakkr(at)zf(dot)jcu(dot)cz, t-ishii(at)sra(dot)co(dot)jp, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [HACKERS] compression in LO and other fields
Date: 1999-11-12 15:41:10
Message-ID: m11mIp4-0003kLC@orion.SAPserv.Hamburg.dsh.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:

> wieck(at)debis(dot)com (Jan Wieck) writes:
>
> > But it requires decompression of every tuple into palloc()'d
> > memory during heap access. AFAIK, the heap access routines
> > currently return a pointer to the tuple inside the shm
> > buffer. Don't know what it's performance impact would be.
>
> Good point, but the same will be needed when a tuple is split across
> multiple blocks. I would expect that (given a reasonably fast
> decompressor) there will be a net performance *gain* due to having
> less disk I/O to do. Also, this won't be happening for "every" tuple,
> just those exceeding a size threshold --- we'd be able to tune the
> threshold value to trade off speed and space.

Right, this time it's your good point. All of the problems
will be there on tuple split implementation.

The major problem I see is that a palloc()'d tuple should be
pfree()'d after the fetcher is done with it. Since they are
in buffer actually, the fetcher doesn't have to care.

> One thing that does occur to me is that we need to store the
> uncompressed as well as the compressed data size, so that the
> working space can be palloc'd before starting the decompression.

Yepp - and I'm doing so. Only during compression the result
size isn't known. But there is a well known maximum, that is
the header overhead plus the data size by 1.125 plus 2 bytes
(totally worst case on uncompressable data). And a general
mechanism working on the tuple level would fallback to store
uncompressed data in the case the compressed size is bigger.

> Also, in case it wasn't clear, I was envisioning leaving the tuple
> header uncompressed, so that time quals etc can be checked before
> decompressing the tuple data.

Of course.

Well, you asked for the rates on the smaller html files only.
78 files, 131 bytes min, 10000 bytes max, 4582 bytes avg,
357383 bytes total.

gzip -9 outputs 145659 bytes (59.2%)
gzip -1 outputs 155113 bytes (56.6%)
my code outputs 184109 bytes (48.5%)

67 files, 2000 bytes min, 10000 bytes max, 5239 bytes avg,
351006 bytes total.

gzip -9 outputs 141772 bytes (59.6%)
gzip -1 outputs 151150 bytes (56.9%)
my code outputs 179428 bytes (48.9%)

The threshold will surely be a tuning parameter of interest.
Another tuning option must be to allow/deny compression per
table at all. Then we could have both options, using a
compressing field type to define which portion of a tuple to
compress, or allow to compress the entire tuples.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#========================================= wieck(at)debis(dot)com (Jan Wieck) #

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ross J. Reedstrom 1999-11-12 15:59:55 Re: [HACKERS] compression in LO and other fields
Previous Message Tom Lane 1999-11-12 15:34:45 Re: [HACKERS] union problem version 6.5.3