Quick Links

Re: md5(large_object_id)

From:	"Daniel Verite" <daniel(at)manitou-mail(dot)org>
To:	"Karsten Hilbert" <Karsten(dot)Hilbert(at)gmx(dot)net>
Cc:	"pgsql-general" <pgsql-general(at)postgresql(dot)org>
Subject:	Re: md5(large_object_id)
Date:	2015-10-06 12:32:19
Message-ID:	c3ddc10b-732d-4023-9964-465910754fe6@mm
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Karsten Hilbert wrote:

> On Mon, Oct 05, 2015 at 03:27:26PM +0000, Kevin Grittner wrote:
>
> > Karsten Hilbert <Karsten(dot)Hilbert(at)gmx(dot)net> wrote:
> >
> > > I am dealing with radiology studies aka DICOM data) one would
> > > want an md5 function which streams in parts of a large object
> > > piece by piece using md5_update and m5_finalize or some such.
> > It would certainly be possible to write a lo_md5(oid) function to do
> > this, but as far as I'm aware nobody has yet done so. How are your
> > C skills?
>
> Next to none, unfortunately :-/

It would be nice to have this API exposed in pgcrypto,
but as a workaround, you could use plperl instead, or a
similarly "easy" PL that provides it.

Here's an example with plperlu, using the core
Digest module. It works with other hash types, not just MD5.
See http://perldoc.perl.org/Digest.html for the exact names
(warning: 'SHA-1' instead of pgcrypto's 'SHA1').

CREATE OR REPLACE FUNCTION lo_digest(text, oid, int)
RETURNS bytea AS
$$
use Digest;
use strict;
use Data::Dumper;
my $ctxt = Digest->new($_[0]);
my $sz=$_[2];
elog(ERROR, "Invalid chunk size: $sz") if ($sz<=0);
my $sth = spi_query("SELECT lo_open($_[1], 262144) as fd");
my $row = spi_fetchrow($sth);
spi_cursor_close($sth);

if ($row) {
my $fd = $row->{fd};
my $bytes;
my $plan = spi_prepare("SELECT loread($fd, $sz) as chunk");
do {
$sth = spi_query_prepared($plan);
$row = spi_fetchrow($sth);
$bytes = decode_bytea($row->{chunk});
$ctxt->add($bytes);
spi_cursor_close($sth);
} while (length($bytes)>0);
spi_exec_query("select lo_close($fd)");
spi_freeplan($plan);
}
return encode_bytea($ctxt->digest);
$$ LANGUAGE plperlu;

Example of use:
# select lo_digest('MD5', 2557608, 2048*256);

lo_digest
------------------------------------
\xa8447e145d0f8d9ca7fe7df1bbf06d75

2557608 is the oid and 2048*256 represents 512KB.
The lo blocksize is typically 2048, so a multiple of 2048 is ideal
to optimize reads.

I expect it to work on very large contents while requesting only modest
amounts of memory. But unfortunately is seems pretty slow.

On my desktop core i5-3470, it takes 2.5 seconds to digest
a 100 MB audio file already in cache. An independant Perl script
doing the equiavlent processing on the same data takes 0.4s .

It's not because of Digest::MD5 or loread(), it seems that it's the
pipe in-between with the text->bytea decoding that eats most of the
CPU cycles, that step being necessary because plperl lacks the ability
to consume bytea directly. Sigh...

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

In response to

Re: md5(large_object_id) at 2015-10-05 18:44:12 from Karsten Hilbert

Browse pgsql-general by date

	From	Date	Subject
Next Message	Adrian Klaver	2015-10-06 12:49:09	Re: Recording exceptions within function (autonomous transactions?)
Previous Message	Scott Mead	2015-10-06 10:46:16	Re: How to reduce pg_dump backup time