Re: tableam vs. TOAST

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Prabhat Sahu <prabhat(dot)sahu(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: tableam vs. TOAST
Date: 2019-08-05 19:36:59
Message-ID: CA+TgmoZFUK3M9ShVuphVT16t=d3TpUP7d9axo-e9DjJe70WBSQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Aug 2, 2019 at 6:42 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> Hm, those all include writing, right? And for read-only we don't expect
> any additional overhead, correct? The write overhead is probably too
> large show a bit of function call overhead - but if so, it'd probably be
> on unlogged tables? And with COPY, because that utilizes multi_insert,
> which means more toasting in a shorter amount of time?

Yes and yes. I guess we could test the unlogged case and with COPY,
but in any realistic case you're still looking for a tiny CPU overhead
in a sea of I/O costs. Even if an extremely short COPY on an unlogged
table regresses slightly, we do not normally reject patches that
improve code quality on the grounds that they add function call
overhead in a few places. Code like this hard to optimize and
maintain; as you remarked yourself, there are multiple opportunities
to do this stuff better that are hard to see in the current structure.

> .oO(why does everyone attach attachements out of order? Is that
> a gmail thing?)

Must be.

> I wonder if toasting.c should be moved too?

I mean, we could, but I don't really see a reason. It'd just be
moving it from one fairly-generic place to another, and I'd rather
minimize churn.

> trailing whitespace after "#ifndef DETOAST_H ".

Will fix.

> Hm. This leaves toast_insert_or_update() as a name. That makes it sound
> like it's generic toast code, rather than heap specific?

I'll rename it to heap_toast_insert_or_update(). But I think I'll put
that in 0004 with the other renames.

> It's definitely nice how a lot of repetitive code has been deduplicated.
> Also makes it easier to see how algorithmically expensive
> toast_insert_or_update() is :(.

Yep.

> Shouldn't this condition be the other way round?

I had to fight pretty hard to stop myself from tinkering with the
algorithm -- this can clearly be done better, but I wanted to make it
match the existing structure as far as possible. It also only needs to
be tested once, not on every loop iteration, so if we're going to
start changing things, we should go further than just swapping the
order of the tests. For now I prefer to draw a line in the sand and
change nothing.

> Couldn't most of these be handled via colflags, instead of having
> numerous individual checks? I.e. if you had TOASTCOL_COMPRESSED,
> TOASTCOL_IGNORED, TOASTCOL_MAIN, TOASTFLAG_EXTERNAL, etc, all but the
> size check ought to boil down to a single mask test?

I'm not really seeing how more flags would significantly simplify this
logic, but I might be missing something.

> I wonder if a better prefix wouldn't be toasting_...

I'm open to other votes, but I think it's toast_tuple is more specific
than toasting_ and thus better.

> > +/*
> > + * Information about one column of a tuple being toasted.
> > + *
> > + * NOTE: toast_action[i] can have these values:
> > + * ' ' default handling
> > + * 'p' already processed --- don't touch it
> > + * 'x' incompressible, but OK to move off
> > + *
> > + * NOTE: toast_attr[i].tai_size is only made valid for varlena attributes with
> > + * toast_action[i] different from 'p'.
> > + */
> > +typedef struct
> > +{
> > + struct varlena *tai_oldexternal;
> > + int32 tai_size;
> > + uint8 tai_colflags;
> > +} ToastAttrInfo;
>
> I think that comment is outdated?

Oops. Will fix.

> > +/*
> > + * Flags indicating the overall state of a TOAST operation.
> > + *
> > + * TOAST_NEEDS_DELETE_OLD indicates that one or more old TOAST datums need
> > + * to be deleted.
> > + *
> > + * TOAST_NEEDS_FREE indicates that one or more TOAST values need to be freed.
> > + *
> > + * TOAST_HAS_NULLS indicates that nulls were found in the tuple being toasted.
> > + *
> > + * TOAST_NEEDS_CHANGE indicates that a new tuple needs to built; in other
> > + * words, the toaster did something.
> > + */
> > +#define TOAST_NEEDS_DELETE_OLD 0x0001
> > +#define TOAST_NEEDS_FREE 0x0002
> > +#define TOAST_HAS_NULLS 0x0004
> > +#define TOAST_NEEDS_CHANGE 0x0008
>
> I'd make these enums. They're more often accessible in a debugger...

Ugh, I hate that style. Abusing enums to make flag bits makes my skin
crawl. I always wondered what the appeal was -- I guess now I know.
Blech.

> I'm quite unconvinced that making the chunk size specified by the AM is
> a good idea to do in isolation. We have TOAST_MAX_CHUNK_SIZE in
> pg_control etc. It seems a bit dangerous to let AMs provide the size,
> without being very clear that any change of the value will make data
> inaccessible. It'd be different if the max were only used during
> toasting.

I was actually thinking about proposing that we rip
TOAST_MAX_CHUNK_SIZE out of pg_control. No real effort has been made
here to make this something that users could configure, and I don't
know of a good reason to configure it. It also seems pretty out of
place in a world where there are multiple AMs floating around -- why
should heap, and only heap, be checked there? Granted it does have
some pride of place, but still.

> I think the *size* checks should be weakened so we check:
> 1) After each chunk, whether the already assembled chunks exceed the
> expected size.
> 2) After all chunks have been collected, check that the size is exactly
> what we expect.
>
> I don't think that removes a meaningful amount of error
> checking. Missing tuples etc get detected by the chunk_ids not being
> consecutive. The overall size is still verified.
>
> The obvious problem with this is the slice fetching logic. For slices
> with an offset of 0, it's obviously trivial to implement. For the higher
> slice logic, I'd assume we'd have to fetch the first slice by estimating
> where the start chunk is based on the current suggest chunk size, and
> restarting the scan earlier/later if not accurate. In most cases it'll
> be accurate, so we'd not loose efficiency.

I don't feel entirely convinced that there's any rush to do all of
this right now, and the more I change the harder it is to make sure
that I haven't broken anything. How strongly do you feel about this
stuff?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2019-08-05 19:39:16 Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Previous Message Daniel Migowski 2019-08-05 19:35:13 Re: Adding column "mem_usage" to view pg_prepared_statements