Re: WIP/PoC for parallel backup

From: Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>
To: Ahsan Hadi <ahsan(dot)hadi(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>, Kashif Zeeshan <kashif(dot)zeeshan(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP/PoC for parallel backup
Date: 2020-04-15 09:28:37
Message-ID: CAKcux6nj2T_U+4c6pJVUb8yGmQ-J4B2D3BRErw3iiXgkgCGAnQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Asif,

In below scenarios backup verification failed for tablespace, when backup
taken with parallel option.
without parallel for the same scenario pg_verifybackup is passed without
any error.

[edb(at)localhost bin]$ mkdir /tmp/test_bkp/tblsp1
[edb(at)localhost bin]$ ./psql postgres -p 5432 -c "create tablespace tblsp1
location '/tmp/test_bkp/tblsp1';"
CREATE TABLESPACE
[edb(at)localhost bin]$ ./psql postgres -p 5432 -c "create table test (a text)
tablespace tblsp1;"
CREATE TABLE
[edb(at)localhost bin]$ ./psql postgres -p 5432 -c "insert into test values
('parallel_backup with -T tablespace option');"
INSERT 0 1
[edb(at)localhost bin]$ ./pg_basebackup -p 5432 -D /tmp/test_bkp/bkp -T
/tmp/test_bkp/tblsp1=/tmp/test_bkp/tblsp2 -j 4
[edb(at)localhost bin]$ ./pg_verifybackup /tmp/test_bkp/bkp
pg_verifybackup: error: "pg_tblspc/16384/PG_13_202004074/13530/16390" is
present on disk but not in the manifest
pg_verifybackup: error: "pg_tblspc/16384/PG_13_202004074/13530/16388" is
present on disk but not in the manifest
pg_verifybackup: error: "pg_tblspc/16384/PG_13_202004074/13530/16385" is
present on disk but not in the manifest
pg_verifybackup: error: "/PG_13_202004074/13530/16388" is present in the
manifest but not on disk
pg_verifybackup: error: "/PG_13_202004074/13530/16390" is present in the
manifest but not on disk
pg_verifybackup: error: "/PG_13_202004074/13530/16385" is present in the
manifest but not on disk

--without parallel backup
[edb(at)localhost bin]$ ./pg_basebackup -p 5432 -D /tmp/test_bkp/bkp1 -T
/tmp/test_bkp/tblsp1=/tmp/test_bkp/tblsp3 -j 1
[edb(at)localhost bin]$ ./pg_verifybackup /tmp/test_bkp/bkp1
backup successfully verified

Thanks & Regards,
Rajkumar Raghuwanshi

On Wed, Apr 15, 2020 at 2:19 PM Ahsan Hadi <ahsan(dot)hadi(at)gmail(dot)com> wrote:

>
>
> On Wed, 15 Apr 2020 at 1:49 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>> On Tue, Apr 14, 2020 at 10:37 AM Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>
>> wrote:
>> > I forgot to make a check for no-manifest. Fixed. Attached is the
>> updated patch.
>>
>> +typedef struct
>> +{
>> ...
>> +} BackupFile;
>> +
>> +typedef struct
>> +{
>> ...
>> +} BackupState;
>>
>> These structures need comments.
>>
>> +list_wal_files_opt_list:
>> + SCONST SCONST
>> {
>> - $$ = makeDefElem("manifest_checksums",
>> -
>> (Node *)makeString($2), -1);
>> + $$ = list_make2(
>> + makeDefElem("start_wal_location",
>> + (Node *)makeString($2),
>> -1),
>> + makeDefElem("end_wal_location",
>> + (Node *)makeString($2),
>> -1));
>> +
>> }
>>
>> This seems like an unnecessarily complicated parse representation. The
>> DefElems seem to be completely unnecessary here.
>>
>> @@ -998,7 +1110,37 @@ SendBaseBackup(BaseBackupCmd *cmd)
>> set_ps_display(activitymsg);
>> }
>>
>> - perform_base_backup(&opt);
>> + switch (cmd->cmdtag)
>>
>> So the design here is that SendBaseBackup() is now going to do a bunch
>> of things that are NOT sending a base backup? With no updates to the
>> comments of that function and no change to the process title it sets?
>>
>> - return (manifest->buffile != NULL);
>> + return (manifest && manifest->buffile != NULL);
>>
>> Heck no. It appears that you didn't even bother reading the function
>> header comment.
>>
>> + * Send a single resultset containing XLogRecPtr record (in text format)
>> + * TimelineID and backup label.
>> */
>> static void
>> -SendXlogRecPtrResult(XLogRecPtr ptr, TimeLineID tli)
>> +SendXlogRecPtrResult(XLogRecPtr ptr, TimeLineID tli,
>> + StringInfo label, char *backupid)
>>
>> This just casually breaks wire protocol compatibility, which seems
>> completely unacceptable.
>>
>> + if (strlen(opt->tablespace) > 0)
>> + sendTablespace(opt->tablespace, NULL, true, NULL, &files);
>> + else
>> + sendDir(".", 1, true, NIL, true, NULL, NULL, &files);
>> +
>> + SendFilesHeader(files);
>>
>> So I guess the idea here is that we buffer the entire list of files in
>> memory, regardless of size, and then we send it out afterwards. That
>> doesn't seem like a good idea. The list of files might be very large.
>> We probably need some code refactoring here rather than just piling
>> more and more different responsibilities onto sendTablespace() and
>> sendDir().
>>
>> + if (state->parallel_mode)
>> + SpinLockAcquire(&state->lock);
>> +
>> + state->throttling_counter += increment;
>> +
>> + if (state->parallel_mode)
>> + SpinLockRelease(&state->lock);
>>
>> I don't like this much. It seems to me that we would do better to use
>> atomics here all the time, instead of conditional spinlocks.
>>
>> +static void
>> +send_file(basebackup_options *opt, char *file, bool missing_ok)
>> ...
>> + if (file == NULL)
>> + return;
>>
>> That seems totally inappropriate.
>>
>> + sendFile(file, file + basepathlen, &statbuf,
>> true, InvalidOid, NULL, NULL);
>>
>> Maybe I'm misunderstanding, but this looks like it's going to write a
>> tar header, even though we're not writing a tarfile.
>>
>> + else
>> + ereport(WARNING,
>> + (errmsg("skipping special file
>> or directory \"%s\"", file)));
>>
>> So, if the user asks for a directory or symlink, what's going to
>> happen is that they're going to receive an empty file, and get a
>> warning. That sounds like terrible behavior.
>>
>> + /*
>> + * Check for checksum failures. If there are failures across
>> multiple
>> + * processes it may not report total checksum count, but it will
>> error
>> + * out,terminating the backup.
>> + */
>>
>> In other words, the patch breaks the feature. Not that the feature in
>> question works particularly well as things stand, but this makes it
>> worse.
>>
>> I think this patch (0003) is in really bad shape. I'm having second
>> thoughts about the design, but it's kind of hard to even have a
>> discussion about the design when the patch is riddled with minor
>> problems like inadequate comments, failure to update existing
>> comments, and breaking a bunch of things. I understand that sometimes
>> things get missed, but this is version 14 of a patch that's been
>> kicking around since last August.
>
>
> Fair enough. Some of this is also due to backup related features i.e
> backup manifest, progress reporting that got committed to master towards
> the tail end of PG-13. Rushing to get parallel backup feature compatible
> with these features also caused some of the oversights.
>
>
>>
>> --
>> Robert Haas
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>>
>>
>> --
> Highgo Software (Canada/China/Pakistan)
> URL : http://www.highgo.ca
> ADDR: 10318 WHALLEY BLVD, Surrey, BC
> EMAIL: mailto: ahsan(dot)hadi(at)highgo(dot)ca
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter J. Holzer 2020-04-15 10:01:46 pg_restore: could not close data file: Success
Previous Message Ants Aasma 2020-04-15 09:05:47 Re: Parallel copy