Quick Links

Re: where should I stick that backup?

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Bruce Momjian <bruce(at)momjian(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: where should I stick that backup?
Date:	2020-04-17 16:19:32
Message-ID:	CA+Tgmoa=fYTLHahw7dbMnvRTROY45c-TMN3P5-XZZYAww76oZg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Apr 16, 2020 at 10:22 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Hmm. Could we learn what we need to know about this by doing something
> as taking a basebackup of a cluster with some data in it (say, created
> by pgbench -i -s 400 or something) and then comparing the speed of cat
> < base.tar | gzip > base.tgz to the speed of gzip < base.tar >
> base.tgz? It seems like there's no difference between those except
> that the first one relays through an extra process and an extra pipe.

I decided to try this. First I experimented on my laptop using a
backup of a pristine pgbench database, scale factor 100, ~1.5GB.

[rhaas pgbackup]$ for i in 1 2 3; do echo "= run number $i = "; sync;
sync; time gzip < base.tar > base.tar.gz; rm -f base.tar.gz; sync;
sync; time cat < base.tar | gzip > base.tar.gz; rm -f base.tar.gz;
sync; sync; time cat < base.tar | cat | cat | gzip > base.tar.gz; rm
-f base.tar.gz; done

= run number 1 =
real 0m24.011s
user 0m23.542s
sys 0m0.408s

real 0m23.623s
user 0m23.447s
sys 0m0.908s

real 0m23.688s
user 0m23.847s
sys 0m2.085s
= run number 2 =

real 0m23.704s
user 0m23.290s
sys 0m0.374s

real 0m23.389s
user 0m23.239s
sys 0m0.879s

real 0m23.762s
user 0m23.888s
sys 0m2.057s
= run number 3 =

real 0m23.567s
user 0m23.187s
sys 0m0.361s

real 0m23.573s
user 0m23.422s
sys 0m0.903s

real 0m23.749s
user 0m23.884s
sys 0m2.113s

It looks like piping everything through an extra copy of 'cat' may
even be *faster* than having the process read it directly; two out of
three runs with the extra "cat" finished very slightly quicker than
the test where gzip read the file directly. The third set of numbers
for each test run is with three copies of "cat" interposed. That
appears to be slower than with no extra pipes, but not very much, and
it might just be noise.

Next I tried it out on Linux. For this I used 'cthulhu', an older box
with lots and lots of memory and cores. Here I took the scale factor
up to 400, so it's about 5.9GB of data. Same command as above produced
these results:

= run number 1 =

real 2m35.797s
user 2m30.990s
sys 0m4.760s

real 2m35.407s
user 2m32.730s
sys 0m16.714s

real 2m40.598s
user 2m39.054s
sys 0m37.596s
= run number 2 =

real 2m35.529s
user 2m30.971s
sys 0m4.510s

real 2m33.933s
user 2m31.685s
sys 0m16.003s

real 2m45.563s
user 2m44.042s
sys 0m40.357s
= run number 3 =

real 2m35.876s
user 2m31.437s
sys 0m4.391s

real 2m33.872s
user 2m31.629s
sys 0m16.266s

real 2m40.836s
user 2m39.359s
sys 0m38.960s

These results are pretty similar to the MacOS results. The overall
performance was worse, but I think that is probably explained by the
fact that the MacBook is a Haswell-class processor rather than
Westmere, and with significantly higher RAM speed. The pattern that
one extra pipe seems to be perhaps slightly faster, and three extra
pipes a tad slower, persists. So at least in this test, the overhead
added by each pipe appears to be <1%, which I would classify as good
enough not to worry too much about.

> I don't know exactly how to do the equivalent of this on Windows, but
> I bet somebody does.

However, I still don't know what the situation is on Windows. I did do
some searching around on the Internet to try to find out whether pipes
being slow on Windows is a generally-known phenomenon, and I didn't
find anything very compelling, but I don't have an environment set up
to the test myself.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: where should I stick that backup? at 2020-04-17 02:22:38 from Robert Haas

Responses

Re: where should I stick that backup? at 2020-04-17 23:44:08 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2020-04-17 16:30:23	Re: Race condition in SyncRepGetSyncStandbysPriority
Previous Message	Tom Lane	2020-04-17 16:01:43	Re: matchingsel() and NULL-returning operators