From: | "Bossart, Nathan" <bossartn(at)amazon(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com>, Julien Rouhaud <rjuju123(at)gmail(dot)com> |
Cc: | "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: parallelizing the archiver |
Date: | 2021-09-10 17:06:59 |
Message-ID: | 4DCB0B93-EA51-405F-8542-423F37CB8A72@amazon.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 9/10/21, 8:22 AM, "Robert Haas" <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Sep 10, 2021 at 10:19 AM Julien Rouhaud <rjuju123(at)gmail(dot)com> wrote:
>> Those approaches don't really seems mutually exclusive? In both case
>> you will need to internally track the status of each WAL file and
>> handle non contiguous file sequences. In case of parallel commands
>> you only need additional knowledge that some commands is already
>> working on a file. Wouldn't it be even better to eventually be able
>> launch multiple batches of multiple files rather than a single batch?
>
> Well, I guess I'm not convinced. Perhaps people with more knowledge of
> this than I may already know why it's beneficial, but in my experience
> commands like 'cp' and 'scp' are usually limited by the speed of I/O,
> not the fact that you only have one of them running at once. Running
> several at once, again in my experience, is typically not much faster.
> On the other hand, scp has a LOT of startup overhead, so it's easy to
> see the benefits of batching.
>
> [...]
>
>> If we start with parallelism first, the whole ecosystem could
>> immediately benefit from it as is. To be able to handle multiple
>> files in a single command, we would need some way to let the server
>> know which files were successfully archived and which files weren't,
>> so it requires a different communication approach than the command
>> return code.
>
> That is possibly true. I think it might work to just assume that you
> have to retry everything if it exits non-zero, but that requires the
> archive command to be smart enough to do something sensible if an
> identical file is already present in the archive.
My initial thinking was similar to Julien's. Assuming I have an
archive_command that handles one file, I can just set
archive_max_workers to 3 and reap the benefits. If I'm using an
existing utility that implements its own parallelism, I can keep
archive_max_workers at 1 and continue using it. This would be a
simple incremental improvement.
That being said, I think the discussion about batching is a good one
to have. If the overhead described in your SCP example is
representative of a typical archive_command, then parallelism does
seem a bit silly. We'd essentially be using a ton more resources when
there's obvious room for improvement via reducing amount of overhead
per archive. I think we could easily make the batch size configurable
so that existing archive commands would work (e.g.,
archive_batch_size=1). However, unlike the simple parallel approach,
you'd likely have to adjust your archive_command if you wanted to make
use of batching. That doesn't seem terrible to me, though. As
discussed above, there are some implementation details to work out for
archive failures, but nothing about that seems intractable to me.
Plus, if you still wanted to parallelize things, feeding your
archive_command several files at a time could still be helpful.
I'm currently leaning toward exploring the batching approach first. I
suppose we could always make a prototype of both solutions for
comparison with some "typical" archive commands if that would help
with the discussion.
Nathan
From | Date | Subject | |
---|---|---|---|
Next Message | Jacob Champion | 2021-09-10 17:07:01 | Re: parallelizing the archiver |
Previous Message | Robert Haas | 2021-09-10 16:56:34 | Re: [Patch] ALTER SYSTEM READ ONLY |