Re: md5(large_object_id)

From: "Daniel Verite" <daniel(at)manitou-mail(dot)org>
To: "Karsten Hilbert" <Karsten(dot)Hilbert(at)gmx(dot)net>
Cc: "pgsql-general" <pgsql-general(at)postgresql(dot)org>
Subject: Re: md5(large_object_id)
Date: 2015-10-06 12:32:19
Message-ID: c3ddc10b-732d-4023-9964-465910754fe6@mm
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Karsten Hilbert wrote:

> On Mon, Oct 05, 2015 at 03:27:26PM +0000, Kevin Grittner wrote:
>
> > Karsten Hilbert <Karsten(dot)Hilbert(at)gmx(dot)net> wrote:
> >
> > > I am dealing with radiology studies aka DICOM data) one would
> > > want an md5 function which streams in parts of a large object
> > > piece by piece using md5_update and m5_finalize or some such.
> > It would certainly be possible to write a lo_md5(oid) function to do
> > this, but as far as I'm aware nobody has yet done so. How are your
> > C skills?
>
> Next to none, unfortunately :-/

It would be nice to have this API exposed in pgcrypto,
but as a workaround, you could use plperl instead, or a
similarly "easy" PL that provides it.

Here's an example with plperlu, using the core
Digest module. It works with other hash types, not just MD5.
See http://perldoc.perl.org/Digest.html for the exact names
(warning: 'SHA-1' instead of pgcrypto's 'SHA1').

CREATE OR REPLACE FUNCTION lo_digest(text, oid, int)
RETURNS bytea AS
$$
use Digest;
use strict;
use Data::Dumper;
my $ctxt = Digest->new($_[0]);
my $sz=$_[2];
elog(ERROR, "Invalid chunk size: $sz") if ($sz<=0);
my $sth = spi_query("SELECT lo_open($_[1], 262144) as fd");
my $row = spi_fetchrow($sth);
spi_cursor_close($sth);

if ($row) {
my $fd = $row->{fd};
my $bytes;
my $plan = spi_prepare("SELECT loread($fd, $sz) as chunk");
do {
$sth = spi_query_prepared($plan);
$row = spi_fetchrow($sth);
$bytes = decode_bytea($row->{chunk});
$ctxt->add($bytes);
spi_cursor_close($sth);
} while (length($bytes)>0);
spi_exec_query("select lo_close($fd)");
spi_freeplan($plan);
}
return encode_bytea($ctxt->digest);
$$ LANGUAGE plperlu;

Example of use:
# select lo_digest('MD5', 2557608, 2048*256);

lo_digest
------------------------------------
\xa8447e145d0f8d9ca7fe7df1bbf06d75

2557608 is the oid and 2048*256 represents 512KB.
The lo blocksize is typically 2048, so a multiple of 2048 is ideal
to optimize reads.

I expect it to work on very large contents while requesting only modest
amounts of memory. But unfortunately is seems pretty slow.

On my desktop core i5-3470, it takes 2.5 seconds to digest
a 100 MB audio file already in cache. An independant Perl script
doing the equiavlent processing on the same data takes 0.4s .

It's not because of Digest::MD5 or loread(), it seems that it's the
pipe in-between with the text->bytea decoding that eats most of the
CPU cycles, that step being necessary because plperl lacks the ability
to consume bytea directly. Sigh...

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2015-10-06 12:49:09 Re: Recording exceptions within function (autonomous transactions?)
Previous Message Scott Mead 2015-10-06 10:46:16 Re: How to reduce pg_dump backup time