Quick Links

Re: pg_dump directory archive format / parallel pg_dump

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Joachim Wieland <joe(at)mcknight(dot)de>
Cc:	Jaime Casanova <jaime(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: pg_dump directory archive format / parallel pg_dump
Date:	2011-01-20 11:07:32
Message-ID:	4D381774.8010407@enterprisedb.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 19.01.2011 16:01, Joachim Wieland wrote:
> On Wed, Jan 19, 2011 at 7:47 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>> Here are the latest patches all of them also rebased to current HEAD.
>>> Will update the commitfest app as well.
>>
>> What's the idea of storing the file sizes in the toc file? It looks like
>> it's not used for anything.
>
> It's part of the overall idea to make sure files are not inadvertently
> exchanged between different backups and that a file is not truncated.
> In the future I'd also like to add a checksum to the TOC so that a
> backup can be checked for integrity. This will cost performance but
> with the parallel backup it can be distributed to several processors.

Ok. I'm going to leave out the filesize. I can see some value in that,
and the CRC, but I don't want to add stuff that's not used at this point.

>> It would be nice to have this format match the tar format. At the moment,
>> there's a couple of cosmetic differences:
>>
>> * TOC file is called "TOC", instead of "toc.dat"
>>
>> * blobs TOC file is called "BLOBS.TOC" instead of "blobs.toc"
>>
>> * each blob is stored as "blobs/<oid>.dat", instead of "blob_<oid>.dat"
>
> That can be done easily...
>
>> The only significant difference is that in the directory archive format,
>> each data file has a header in the beginning.
>
>> What are the benefits of the data file header? Would it be better to leave
>> it out, so that the format would be identical to the tar format? You could
>> then just tar up the directory to get a tar archive, or vice versa.
>
> The header is there to identify a file, it contains the header that
> every other pgdump file contains, including the internal version
> number and the unique backup id.
>
> The tar format doesn't support compression so going from one to the
> other would only work for an uncompressed archive and special care
> must be taken to get the order of the tar file right.

Hmm, tar format doesn't support compression, but looks like the file
format issue has been thought of already: there's still code there to
add .gz suffix for compressed files. How about adopting that convention
in the directory format too? That would make an uncompressed directory
format compatible with the tar format.

That seems pretty attractive anyway, because you can then dump to a
directory, and manually gzip the data files later.

Now that we have an API for compression in compress_io.c, it probably
wouldn't be very hard to implement the missing compression support to
tar format either.

> If you want to drop the header altogether, fine with me but if it's
> just for the tar<-> directory conversion, then I am failing to see
> what the use case of that would be.
>
> A tar archive has the advantage that you can postprocess the dump data
> with other tools but for this we could also add an option that gives
> you only the data part of a dump file (and uncompresses it at the same
> time if compressed). Once we have that however, the question is what
> anybody would then still want to use the tar format for...

I don't know how popular it'll be in practice, but it seems very nice to
me if you can do things like parallel pg_dump in directory format first,
and then tar it up to a file for archival.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Re: pg_dump directory archive format / parallel pg_dump at 2011-01-19 14:01:46 from Joachim Wieland

Responses

Re: pg_dump directory archive format / parallel pg_dump at 2011-01-20 13:46:28 from Joachim Wieland

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Magnus Hagander	2011-01-20 11:42:51	Re: pg_basebackup for streaming base backups
Previous Message	Marko Tiikkaja	2011-01-20 10:22:37	Re: Transaction-scope advisory locks