From: | hubert depesz lubaczewski <depesz(at)gmail(dot)com> |
---|---|
To: | matt(at)byrney(dot)com |
Cc: | pgsql-general <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Table checksum proposal |
Date: | 2014-07-24 11:43:29 |
Message-ID: | CAKrjmhcorNLdR-6fxYakfdeD2n0=sJqdsivn9g7XwVx5o6mzgQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Thu, Jul 24, 2014 at 3:35 AM, <matt(at)byrney(dot)com> wrote:
> I have a suggestion for a table checksumming facility within PostgreSQL.
> The applications are reasonably obvious - detecting changes to tables,
> validating data migrations, unit testing etc. A possible algorithm is as
> follows:
>
> 1. For each row of the table, take the binary representations of the
> values and serialise them to CSV.
> 2. Calculate the MD5 sum of each CSV-serialised row.
> 3. XOR the row MD5 sums together.
> 4. CSV-serialise and MD5 a list of representations (of some sort) of the
> types of the table's columns and XOR it with the rest.
> 5. Output the result as the table's checksum.
>
> Advantages of this approach:
>
> 1. Easily implemented using SPI.
> 2. Since XOR is commutative and associative, order of ingestion of rows
> doesn't matter; therefore, unlike some other table checksumming methods,
> this doesn't need an expensive ORDER BY *. So, this should be pretty much
> as fast as a SELECT * FROM, which is probably as fast as a table checksum
> can be.
> 3. Using a cursor in SPI, rows can be ingested a few at a time. So memory
> footprint is low even for large tables.
> 4. Output has a convenient fixed size of 128 bits.
>
> Questions:
>
> 1. Should this be a contrib module which provides a function, or should it
> be a built-in piece of functionality?
> 2. Is MD5 too heavyweight for this? Would using a non-cryptographic
> checksum be worth the speed boost?
> 3. Is there a risk of different architectures/versions returning different
> checksums for tables which could be considered identical? If so, is this
> worth worrying about?
>
Hmm - Do you really think we need an extension for something that can be
done using query as simple as:
select md5(string_agg(md5(c::text), '' order by md5(c::text))) from
pg_class c;
(of course you can do it on any table, not only pg_class).
If you want to use the xor idea (which make sense), all you need is to
write xor aggregate.
depesz
From | Date | Subject | |
---|---|---|---|
Next Message | Ramesh T | 2014-07-24 12:54:56 | Re: tab_to_sting |
Previous Message | hubert depesz lubaczewski | 2014-07-24 11:34:00 | Re: event triggers in 9.3.4 |