Re: Table checksum proposal

From: hubert depesz lubaczewski <depesz(at)gmail(dot)com>
To: matt(at)byrney(dot)com
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Table checksum proposal
Date: 2014-07-24 11:43:29
Message-ID: CAKrjmhcorNLdR-6fxYakfdeD2n0=sJqdsivn9g7XwVx5o6mzgQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, Jul 24, 2014 at 3:35 AM, <matt(at)byrney(dot)com> wrote:

> I have a suggestion for a table checksumming facility within PostgreSQL.
> The applications are reasonably obvious - detecting changes to tables,
> validating data migrations, unit testing etc. A possible algorithm is as
> follows:
>
> 1. For each row of the table, take the binary representations of the
> values and serialise them to CSV.
> 2. Calculate the MD5 sum of each CSV-serialised row.
> 3. XOR the row MD5 sums together.
> 4. CSV-serialise and MD5 a list of representations (of some sort) of the
> types of the table's columns and XOR it with the rest.
> 5. Output the result as the table's checksum.
>
> Advantages of this approach:
>
> 1. Easily implemented using SPI.
> 2. Since XOR is commutative and associative, order of ingestion of rows
> doesn't matter; therefore, unlike some other table checksumming methods,
> this doesn't need an expensive ORDER BY *. So, this should be pretty much
> as fast as a SELECT * FROM, which is probably as fast as a table checksum
> can be.
> 3. Using a cursor in SPI, rows can be ingested a few at a time. So memory
> footprint is low even for large tables.
> 4. Output has a convenient fixed size of 128 bits.
>
> Questions:
>
> 1. Should this be a contrib module which provides a function, or should it
> be a built-in piece of functionality?
> 2. Is MD5 too heavyweight for this? Would using a non-cryptographic
> checksum be worth the speed boost?
> 3. Is there a risk of different architectures/versions returning different
> checksums for tables which could be considered identical? If so, is this
> worth worrying about?
>

Hmm - Do you really think we need an extension for something that can be
done using query as simple as:

select md5(string_agg(md5(c::text), '' order by md5(c::text))) from
pg_class c;

(of course you can do it on any table, not only pg_class).

If you want to use the xor idea (which make sense), all you need is to
write xor aggregate.

depesz

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ramesh T 2014-07-24 12:54:56 Re: tab_to_sting
Previous Message hubert depesz lubaczewski 2014-07-24 11:34:00 Re: event triggers in 9.3.4