Re: parallel pg_restore

From: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Joshua Drake <jd(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: parallel pg_restore
Date: 2008-09-24 19:10:44
Message-ID: 464C4848-FDEC-40A3-90D1-0BE285CA0C46@hi-media.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Le 24 sept. 08 à 18:56, Andrew Dunstan a écrit :
> The is purely a patch to pg_restore. No backend changes at all (and
> if I did it would not use anything that isn't in core anyway).

Ok, good.
I'm eager to see what -core hackers will want to do with Postgres-R
patches, but that shouldn't be a reason to distract them, sorry...

> Also, you ignored the point about clustered data. Maybe that doesn't
> matter to some people, but it does to others. This is designed to
> provide the same result as a single threaded pg_restore. Splitting
> data will break that.

I'm not sure I understand what you mean by "clustered data" here, in
fact...

> Having pg_dump do the split would mean you get it for free, pretty
> much. Rejecting that for a solution that could well be a bottleneck
> at restore time would require lots more than just a feeling. I don't
> see how it would give you any less reason to trust your backups.

Well, when pg_restore's COPY fail, the table is not loaded and you get
an ERROR, and if you're running with the -1 option, the restore stops
here and you get a nice ROLLBACK.
With this later option, even if pg_dump did split your tables, the
ROLLBACK still happens.

Now, what happens when only one part of the data cannot be restored
but you didn't pg_restore -1. I guess you're simply left with a
partially restored table. How will you know which part contains the
error? How will you replay the restoring of this part only?

It the answer is to play with the restore catalogue, ok, if that's not
it, I'm feeling the dumps are now less trustworthy with the split
option than they were before.

Of course all this remains hypothetical as your work is not including
such a feature, which as we see is yet to be designed.

> I still think the multiple data members of the archive approach
> would be best here. One that allowed you to tell pg_dump to split
> every nn rows, or every nn megabytes. Quite apart from any
> parallelism issues, that could help enormously when there is a data
> problem as happens from time to time, and can get quite annoying if
> it's in the middle of a humungous data load.

Agreed, but it depends a lot on the ways to control the part that
failed, IMVHO. And I think we'd prefer to have a version of COPY FROM
with the capability to continue loading on failure...

Regards,
- --
dim

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)

iEYEARECAAYFAkjakLQACgkQlBXRlnbh1bm4jgCg0WenIOsaHwD9GDpI6C2mhVYB
pdwAoJYesvDYByQbSxqMjIEZOR9KiVXu
=AVy3
-----END PGP SIGNATURE-----

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2008-09-24 19:29:33 Re: Updates of SE-PostgreSQL 8.4devel patches
Previous Message Bruce Momjian 2008-09-24 18:56:30 Re: Proposal of SE-PostgreSQL patches (for CommitFest:Sep)