Re: constants for tar header offsets

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: constants for tar header offsets
Date: 2023-04-18 15:53:08
Message-ID: CA+TgmoZniAFyu-UOt-OtbGnVnc00BhGbN6sMjH_tX+XNoj1a_w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 18, 2023 at 11:38 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > We have a few different places in the code where we generate or modify
> > tar headers or just read data out of them. The code in question uses
> > one of my less-favorite programming things: magic numbers. The offsets
> > of the various fields within the tar header are just hard-coded in
> > each relevant place in our code. I think we should clean that up, as
> > in the attached patch.
>
> Generally +1, with a couple of additional thoughts:
>
> 1. Is it worth inventing macros for the values of the file type,
> rather than just writing the comment you did?

Might be.

> 2. The header size is defined as 512 bytes, but this doesn't sum to 512:
>
> + TAR_OFFSET_PREFIX = 345 /* 155 byte string */
>
> Either that field's length is really 167 bytes, or there's some other
> field you didn't document. (It looks like you may have copied "155"
> from an incorrect existing comment?)

According to my research, it is neither of those, e.g. see

https://www.subspacefield.org/~vax/tar_format.html
https://www.ibm.com/docs/en/zos/2.4.0?topic=formats-tar-format-tar-archives
https://wiki.osdev.org/USTAR

I think that what happened is that whoever designed the original tar
format decided on 512 byte blocks. And the header did not take up the
whole block. The USTAR format is an extension of the original format
which uses more of the block, but still not all of it.

> Yeah, this is adding greppability (a good thing!) but little more.
> However, I'm not convinced it's worth doing more. It's not like
> this data structure will change anytime soon.

Right. greppability is a major concern for me here, and also bug
surface. Right now, to use the functions in pgtar.h, you need to know
all the header offsets as well as the format and length of each header
field. This centralizes constants for the header offsets, and at least
provides some centralized documentation of the rest. It's not great,
though, because it seems like there's some risk of someone writing new
code and getting confused about whether the length of a certain field
is 8 or 12, for example. A thicker abstraction layer might be able to
avoid or minimize such risks better than what we have, but I don't
really know how to design it, whereas this seems like an obvious
improvement.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-04-18 16:06:09 Re: constants for tar header offsets
Previous Message Tom Lane 2023-04-18 15:51:54 Re: Request for comment on setting binary format output per session