Re: [HACKERS] LONG

From: wieck(at)debis(dot)com (Jan Wieck)
To: tgl(at)sss(dot)pgh(dot)pa(dot)us (Tom Lane)
Cc: pgman(at)candle(dot)pha(dot)pa(dot)us, wieck(at)debis(dot)com, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [HACKERS] LONG
Date: 1999-12-12 20:45:56
Message-ID: m11xFsS-0003kGC@orion.SAPserv.Hamburg.dsh.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:

> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Also, my idea was to auto-enable longs for all varlena types, so short
> > values stay in the table, while longer chained ones that take up lots of
> > space and are expensive to expand are retrieved only when needed.
>
> I missed most of yesterday's discussion (was off fighting a different
> fire...). This morning in the shower I had a brilliant idea, which
> I now see Bruce has beaten me to ;-)
>
> The idea of doing tuple splitting by pushing "long" fields out of line,
> rather than just cutting up the tuple at arbitrary points, is clearly
> a win for the reasons Bruce and Jan point out. But I like Bruce's
> approach (automatically do it for any overly-long varlena attribute)
> much better than Jan's (invent a special LONG datatype). A special
> datatype is bad for several reasons:
> * it forces users to kluge up their database schemas;
> * inevitably, users will pick the wrong columns to make LONG (it's
> a truism that programmers seldom guess right about what parts of
> their programs consume the most resources; users would need a
> "profiler" to make the right decisions);
> * it doesn't solve the problems for arrays, which desperately need it;
> * we'd need to add a whole bunch of operations on the special datatype;

O.K.,

you two got me now.

>
> I think that the right place to implement this is in heapam, and that
> it should go more or less like this:
>
> 1. While writing out a tuple, if the total tuple size is "too big"
> (threshold would be some fraction of BLCKSZ, yet to be chosen),
> then the tuple manager would go through the tuple to find the longest
> varlena attribute, and convert same into an out-of-line attribute.
> Repeat if necessary until tuple size fits within threshold.

Yepp. But it does NOT mangle up the tuple handed to it in
place. The flat values in the tuple are sometimes used AFTER
heap_insert() and heap_update(), for example for
index_insert. So that might break other places.

> 2. While reading a tuple, fastgetattr() automatically fetches the
> out-of-line value if it sees the requested attribute is out-of-line.
> (I'd be inclined to mark out-of-line attributes in the same way that
> NULL attributes are marked: one bit in the tuple header shows if any
> out-of-line attrs are present, and if so there is a bitmap to show
> which ones are out-of-line. We could also use Bruce's idea of
> commandeering the high-order bit of the varlena length word, but
> I think that's a much uglier and more fragile solution.)
>
> I think that these two changes would handle 99% of the problem.
> VACUUM would still need work, but most normal access to tuples would
> just work automatically, because all access to varlena fields must go
> through fastgetattr().

And I like Bruce's idea with the high order bit of vl_len.
This is IMHO the only chance, to tell on UPDATE if the value
wasn't changed.

To detect that an UPDATE did not touch the out of line value,
you need the complete long reference information in the
RESULT tuple. The executor must not expand the value while
building them up already.

But Tom is right, there is a visibility problem I haven't
seen before. It is that when fetching the out of line
attribute (for example in the type output function) is done
later than fetching the reference information. Then a
transaction reading dirty or committed might see wrong
content, or worse, see different contents at different
fetches.

The solution I see is to give any out of line datum another
Oid, that is part of it's header and stamped into the
reference data. That way, the long attribute lookup can use
SnapshotAny using this Oid, there can only be one that
exists, so SnapshotAny is safe here and forces that only the
visibility of the master tuple in the main table counts at
all.

Since this Values Oid is known in the Values reference of the
tuple, we only need two indices on the out of line data. One
on this Oid, on on the referencing row's oid|attrno|seq to
be fast in heap_delete() and heap_update().

> An as-yet-unsolved issue is how to avoid memory leaks of out-of-line
> values after they have been read in by fastgetattr(). However, I think
> that's going to be a nasty problem with Jan's approach as well. The
> best answer might be to solve this in combination with addressing the
> problem of leakage of temporary results during expression evaluation,
> say by adding some kind of reference-count convention to all varlena
> values.

At the point we decide to move an attribute out of the tuple,
we make a lookup in an array consisting of type Oid's. Thus,
we have plenty of time to add one datatype after another and
enable them separately for long processing, but get the ones
enabled ASAP (next release) out of the door.

As Bruce suggested, we implement a central function that
fetches back the long value. This is used in all the type
specific funcitons in adt. Now that we have an Oid
identifier per single value, it's easy to implement a cache
there, that can manage a LRU table of the last fetched values
and cache smaller ones for fast access.

It's the response of the types adt functions, to free the
returned (old VARLENA looking) memory. Since we enable the
types one-by-one, there's no need to hurry on this.

> BTW, I don't see any really good reason to keep the out-of-line values
> in a separate physical file (relation) as Jan originally proposed.
> Why not keep them in the same file, but mark them as being something
> different than a normal tuple? Sequential scans would have to know to
> skip over them (big deal), and VACUUM would have to handle them

The one I see is that a sequential scan would not benefit
from this, it still has to read the entire relation, even if
looking only on small, fixed size items in the tuple. Will be
a big win for count(*). And with the mentioned value cache
for relatively small (yet to define what that is) values,
there will be very little overhead in a sort, if the tuples
in it are sorted by an attribute where some long values
occationally appear.

> properly, but I think VACUUM is going to have to have special code to
> support this feature no matter what. If we do make them a new primitive
> kind-of-a-tuple on disk, we could sidestep the problem of marking all
> the out-of-line values associated with a tuple when the tuple is
> outdated by a transaction. The out-of-line values wouldn't have
> transaction IDs in them at all; they'd just be labeled with the CTID
> and/or OID of the primary tuple they belong to. VACUUM would consult
> that tuple to determine whether to keep or discard an out-of-line value.

AFAIK, VACUUM consults single attributes of a tuple only to
produce the statistical informations for them on ANALYZE.
Well, statistical information for columns containing LONG
values aren't good for the WHERE clause (I think we all agree
on that). So it doesn't matter if these informations aren't
totally accurate, or if VACUUM counts them but uses only the
first couple of bytes for the min/max etc. info.

Also, the new long data relations should IMHO have their own
relkind. So VACUUM can easily detect them. This I think is
required, so VACUUM can place an exclusive lock on the main
table first before starting to vacuum the long values (which
can be done as is since it is in fact a normal relation -
just not visible to the user). This should avoid race
conditions as explained above on the visibility problem.

I'll start to play around with this approach for a while,
using lztext as test candidate (with custom compression
parameters that force uncompressed storage). When I have some
reasonable result ready to look at, I'll send a patch here,
so we can continue the discussion while looking at some test
implementation.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#========================================= wieck(at)debis(dot)com (Jan Wieck) #

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 1999-12-12 20:59:27 Re: [HACKERS] 6.6 release
Previous Message Hannu Krosing 1999-12-12 19:55:36 Re: [HACKERS] Re: Jesus, what have I done (was: LONG)