Re: DBMirror.pl performance change

From: Peter Wilson <petew(at)yellowhawk(dot)co(dot)uk>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: DBMirror.pl performance change
Date: 2006-01-23 10:59:37
Message-ID: dr2cun$10r5$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

The Whitebeam implementation of DBMirror.pl :
http://www.whitebeam.org/library/guide/TechNotes/replicate.rhtm

is a complete re-write in 'C' which avoids a lot of the text processing, and what text processing is required is done using a state machine rather
than repeated regular expressions. Before I wrote the 'C' implementation I did look at optimising the Perl version. One of my big concerns was the
time taking escaping and re-escaping the strings. I can't remember the details now but as far as I can remember a lot of that is unnecessary. There
seemed to be an unescape of the data then a re-escape to the target database. In practice the data was in the correct format.

We make quite heavy use of both BYTEA and large varchar fields in our database. I did some load testing at the time and found the new version could
replicate 10s of file objects per second - where the Perl version took 10 minutes to replicate a 120K BYTEA field (both on a slowish machine, but the
Perl version wasn't much better on a fast machine *and* took 97% CPU).

I also took the opportunity to make the 'C' version much more tolerant to lost DB connections without having to restart and added a few other tweaks
to make it more flexible.

It's released under the BSD license now as well

Pete
--
http://www.whitebeam.org
http://www.yellowhawk.co.uk
--------

Achilleus Mantzios wrote:
> I discovered a problem in DBMirror.pl, performance wise.
>
> pending.c stores data in a way
> very similar to the PgSQL input "\" escaped format.
>
> When the field is of type bytea, and the source of data is binary, then
> this produces 2 additional backslashes for every unprintable
> char.
>
> The performance in function extractData in DBMirror.pl, really suffers
> from this condition, since it breaks data in chunks of "\" delimited
> strings.
>
> Informally speaking, performance tends to be O(n) where n is the size
> of the data.
>
> This can be remedied if we break data in chunks of "'" rather than "\".
> "'" happens much more infrequently in common binary files (bz2, tiff, jpg,
> pdf etc..), and if we notice that odd number of contained "\", signals an
> intermidiate "'", whereas even number of "\" signals the final "'",
> then we can make this routine run much faster.
>
> I attach the new extractData function.
>
> Now replicating a 400 k tiff takes 3 seconds instead of 12 minutes
> it used to do.
>
> I am wondering about the state of
> http://www.whitebeam.org/library/guide/TechNotes/replicate.rhtm
>
> Please feel free for any comments.
>
> Pete could you test this new DBMirror.pl, to see how it behaves
> in comparison with your C++ solution?
>
>
>
> ------------------------------------------------------------------------
>
> sub extractData($$) {
> my $pendingResult = $_[0];
> my $currentTuple = $_[1];
> my $fnumber;
> my %valuesHash;
> $fnumber = 4;
> my $dataField = $pendingResult->getvalue($currentTuple,$fnumber);
> my $numofbs;
>
> while(length($dataField)>0) {
> # Extract the field name that is surronded by double quotes
> $dataField =~ m/(\".*?\")/s;
> my $fieldName = $1;
> $dataField = substr $dataField ,length($fieldName);
> $fieldName =~ s/\"//g; #Remove the surronding " signs.
>
> if($dataField =~ m/(^= )/s) {
> #Matched null
> $dataField = substr $dataField , length($1);
> $valuesHash{$fieldName}=undef;
> }
> elsif ($dataField =~ m/(^=\')/s) {
> #Has data.
> my $value;
> $dataField = substr $dataField ,2; #Skip the ='
> LOOP: { #This is to allow us to use last from a do loop.
> #Recommended in perlsyn manpage.
> do {
> my $matchString;
> my $matchString2;
> #Find the substring ending with the first ' or first \
> $dataField =~ m/(.*?[\'])?/s;
> $matchString = $1;
>
> $numofbs = ($matchString =~ tr/\\//) % 2;
>
> if ($numofbs == 1) { #// odd number of \, i.e. intermediate '
> $matchString2 = substr $matchString,0, length($matchString)-2;
> $matchString2 =~ s/\\\\/\\/g;
> $value .= ($matchString2 . "\'");
> $dataField = substr $dataField,length($matchString);
> }
> else { #// even number of \, i.e. found end of data
> $matchString2 = substr $matchString,0, length($matchString)-1;
> $matchString2 =~ s/\\\\/\\/g;
> $value .= $matchString2;
> $dataField = substr $dataField,length($matchString)+1;
> last;
> }
>
>
> } until(length($dataField)==0);
> }
> $valuesHash{$fieldName} = $value;
>
>
> }#else if
> else {
>
> logErrorMessage "Error in PendingData Sequence Id " .
> $pendingResult->getvalue($currentTuple,0);
> die;
> }
>
>
>
> } #while
> return %valuesHash;
>
> }
>
>
> ------------------------------------------------------------------------
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match

In response to

Browse pgsql-general by date

  From Date Subject
Next Message A. Kretschmer 2006-01-23 11:01:29 Re: Postgresql/
Previous Message Janning Vygen 2006-01-23 10:38:28 Postgresql/DBA/Sysadmin Consultant in =?iso-8859-1?q?D=FCsseldorf?=, Germany