Re: table versioning approach (not auditing)

From: Adam Brusselback <adambrusselback(at)gmail(dot)com>
To: Felix Kunde <felix-kunde(at)gmx(dot)de>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: table versioning approach (not auditing)
Date: 2014-10-02 14:38:20
Message-ID: CAMjNa7cjVmEKHe104OsoZA0zC13LchBinRw_E6c7vhQCnpnPHw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Ended up running for 28 min, but it did work as expected.

On Thu, Oct 2, 2014 at 10:27 AM, Adam Brusselback <adambrusselback(at)gmail(dot)com
> wrote:

> Testing that now. Initial results are not looking too performant.
> I have one single table which had 234575 updates done to it. I am rolling
> back 13093 of them. It's been running 20 min now, using 100% of a single
> core, and almost 0 disk. No idea how long it'll run at this point.
>
> This is on an i5 desktop with 16 gigs of ram and an ssd.
>
> This is a pretty good test though, as it's a real world use case (even if
> the data was generated with PGBench). We now know that area needs some
> work before it can be used for anything more than a toy database.
>
> Thanks,
> -Adam
>
> On Thu, Oct 2, 2014 at 7:52 AM, Felix Kunde <felix-kunde(at)gmx(dot)de> wrote:
>
>> Hey there
>>
>> Thanks again for the fix. I was able to merge it into my repo.
>> Also thanks for benchmarking audit. Very interesting results.
>> I wonder how the recreation of former database states scales when
>> processing many deltas.
>> Haven’t done a lot of testing in that direction.
>>
>> I will transfer the code soon to a more public repo on GitHub. As far as
>> I see I have to create an organization for that.
>>
>> Cheers
>> Felix
>>
>> *Gesendet:* Mittwoch, 01. Oktober 2014 um 17:09 Uhr
>>
>> *Von:* "Adam Brusselback" <adambrusselback(at)gmail(dot)com>
>> *An:* "Felix Kunde" <felix-kunde(at)gmx(dot)de>
>> *Cc:* "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
>> *Betreff:* Re: [GENERAL] table versioning approach (not auditing)
>> I know we're kinda hijacking this thread, so sorry for that. If you'd
>> like to do that, i'd be more than happy to use it and push any fixes /
>> changes upstream. I don't have much of a preference on the name either, as
>> long as it's something that makes sense.
>>
>> I would consider myself far from an expert though! Either way, more
>> people using a single solution is a good thing.
>>
>> As a side note, I did some benchmarking this morning and wanted to share
>> the results:
>> pgbench -i -s 140 -U postgres pgbench
>>
>> pgbench -c 4 -j 4 -T 600 -U postgres pgbench
>> no auditing tps: 2854
>> NOTE: Accounts are audited
>> auditing tps: 1278
>>
>> pgbench -c 2 -j 2 -N -T 300 -U postgres pgbench
>> no auditing tps: 2504
>> NOTE: Accounts are audited
>> auditing tps: 822
>>
>> pgbench -c 2 -j 2 -T 300 -U postgres pgbench
>> no auditing tps: 1836
>> NOTE: branches and tellers are audited, accounts are not
>> auditing tps: 505
>>
>> I'd love to see if there are some easy wins to boost the performance.
>>
>> On Wed, Oct 1, 2014 at 5:19 AM, Felix Kunde <felix-kunde(at)gmx(dot)de> wrote:
>>>
>>> Hey there. Thank you very much for that fix! Thats why I'd like to have
>>> a joint development and joint testing. It's way more convincing for users
>>> to go for a solution that is tested by some experts than just by a random
>>> developer :)
>>>
>>> I'm open to create a new project and push the code there. Don't care
>>> about the name. Then we might figure out which parts are already good,
>>> which parts could be improved and where to go next. I think switching to
>>> JSONB for example will be easy, as it offers the same functions than JSON
>>> afaik.
>>>
>>>
>>> Gesendet: Dienstag, 30. September 2014 um 21:16 Uhr
>>> Von: "Adam Brusselback" <adambrusselback(at)gmail(dot)com>
>>> An: "Felix Kunde" <felix-kunde(at)gmx(dot)de>
>>> Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
>>> Betreff: Re: [GENERAL] table versioning approach (not auditing)
>>>
>>> Felix, I'd love to see a single, well maintained project. For example, I
>>> just found yours, and gave it a shot today after seeing this post. I found
>>> a bug when an update command is issued, but the old and new values are all
>>> the same. The trigger will blow up. I've got a fix for that, but if we
>>> had one project that more than a handful of people used, stuff like that
>>> would be quashed very quickly.
>>>
>>> I love the design of it by the way. Any idea what it will take to move
>>> to JSONB for 9.4?
>>>
>>>
>>> On Tue, Sep 30, 2014 at 7:22 AM, Felix Kunde <felix-kunde(at)gmx(dot)de>
>>> wrote:Hey
>>>
>>> yes i'm adding an additional key to each of my tables. First i wanted to
>>> use the primary key as one column in my audit_log table, but in some of my
>>> tables the PK consists of more than one column. Plus it's nice to have one
>>> key that is called the same over all tables.
>>>
>>> To get a former state for one row at date x I need to join the latest
>>> delta BEFORE date x with each delta AFTER date x. If I would log complete
>>> rows, this joining part would not be neccessary, but as I usually work with
>>> spatial databases that have complex geometries and also image files, this
>>> strategy is too harddisk consuming.
>>>
>>> If there are more users following a similar approach, I wonder why we
>>> not throw all the good ideas together, to have one solution that is tested,
>>> maintained and improved by more developpers. This would be great.
>>>
>>> Felix
>>>
>>>
>>> Gesendet: Montag, 29. September 2014 um 23:25 Uhr
>>> Von: "Abelard Hoffman" <abelardhoffman(at)gmail(dot)com[
>>> abelardhoffman(at)gmail(dot)com]>
>>> An: "Felix Kunde" <felix-kunde(at)gmx(dot)de[felix-kunde@gmx.de]>
>>> Cc: "pgsql-general(at)postgresql(dot)org[pgsql-general@postgresql.org]" <
>>> pgsql-general(at)postgresql(dot)org[pgsql-general@postgresql.org]>
>>> Betreff: Re: [GENERAL] table versioning approach (not auditing)
>>>
>>> Thank you Felix, Gavin, and Jonathan for your responses.
>>>
>>> Felix & Jonathan: both of you mention just storing deltas. But if you do
>>> that, how do you associate the delta record with the original row? Where's
>>> the PK stored, if it wasn't part of the delta?
>>>
>>> Felix, thank you very much for the example code. I took a look at your
>>> table schemas. I need to study it more, but it looks like the way you're
>>> handling the PK, is you're adding a separate synthethic key (audit_id) to
>>> each table that's being versioned. And then storing that key along with the
>>> delta.
>>>
>>> So then to find all the versions of a given row, you just need to join
>>> the audit row with the schema_name.table_name.audit_id column. Is that
>>> right? The only potential drawback there is there's no referential
>>> integrity between the audit_log.audit_id and the actual table.
>>>
>>> I do like that approach very much though, in that it eliminates the need
>>> to interrogate the json data in order to perform most queries.
>>>
>>> AH
>>>
>>>
>>>
>>> On Mon, Sep 29, 2014 at 12:26 AM, Felix Kunde <felix-kunde(at)gmx(dot)de[
>>> felix-kunde(at)gmx(dot)de]> wrote:Hey
>>>
>>> i've also tried to implement a database versioning using JSON to log
>>> changes in tables. Here it is:
>>> https://github.com/fxku/audit[https://github.com/fxku/audit][https://github.com/fxku/audit[https://github.com/fxku/audit]]
>>> I've got two versioning tables, one storing information about all
>>> transactions that happened and one where i put the JSON logs of row changes
>>> of each table. I'm only logging old values and not complete rows.
>>>
>>> Then I got a function that recreates a database state at a given time
>>> into a separate schema - either to VIEWs, MVIEWs or TABLES. This database
>>> state could then be indexed in order to work with it. You can also reset
>>> the production state to the recreated past state.
>>>
>>> Unfortunately I've got no time to further work on it at the moment + I
>>> have not done tests with many changes in the database so I can't say if the
>>> recreation process scales well. On downside I've realised is that using the
>>> json_agg function has limits when I've got binary data. It gets too long.
>>> So I'm really looking forward using JSONB.
>>>
>>> There are more plans in my mind. By having a Transaction_Log table it
>>> should be possible to revert only certain transactions. I'm also thinking
>>> of parallel versioning, e.g. different users are all working with their
>>> version of the database and commit their changes to the production state.
>>> As I've got a unique history ID for each table and each row, I should be
>>> able to map the affected records.
>>>
>>> Have a look and tell me what you think of it.
>>>
>>> Cheers
>>> Felix
>>>
>>>
>>> Gesendet: Montag, 29. September 2014 um 04:00 Uhr
>>> Von: "Abelard Hoffman" <abelardhoffman(at)gmail(dot)com[
>>> abelardhoffman(at)gmail(dot)com]>
>>> An: "pgsql-general(at)postgresql(dot)org[pgsql-general@postgresql.org]" <
>>> pgsql-general(at)postgresql(dot)org[pgsql-general@postgresql.org]>
>>> Betreff: [GENERAL] table versioning approach (not auditing)
>>>
>>> Hi. I need to maintain a record of all changes to certain tables so
>>> assist in viewing history and reverting changes when necessary (customer
>>> service makes an incorrect edit, etc.).
>>>
>>> I have studied these two audit trigger examples:
>>>
>>> https://wiki.postgresql.org/wiki/Audit_trigger[https://wiki.postgresql.org/wiki/Audit_trigger][https://wiki.postgresql.org/wiki/Audit_trigger[https://wiki.postgresql.org/wiki/Audit_trigger]]
>>> https://wiki.postgresql.org/wiki/Audit_trigger_91plus
>>>
>>> I've also read about two other approaches to versioning:
>>> 1. maintain all versions in one table, with a flag to indicate which is
>>> the current version
>>> 2. have a separate versions table for each real table, and insert into
>>> the associated version table whenever an update or insert is done.
>>>
>>> My current implementation is based on the wiki trigger examples, using a
>>> single table, and a json column to record the row changes (rather than
>>> hstore). What I like about that, in particular, is I can have a "global,"
>>> chronological view of all versioned changes very easily.
>>>
>>> But there are two types of queries I need to run.
>>> 1. Find all changes made by a specific user
>>> 2. Find all changes related to a specific record
>>>
>>> #1 is simple to do. The versioning table has a user_id column of who
>>> made the change, so I can query on that.
>>>
>>> #2 is more difficult. I may want to fetch all changes to a group of
>>> tables that are all related by foreign keys (e.g., find all changes to
>>> "user" record 849, along with any changes to their "articles," "photos,"
>>> etc.). All of the data is in the json column, of course, but it seems like
>>> a pain to try and build a query on the json column that can fetch all those
>>> relationships (and if I mess it up, I probably won't generate any errors,
>>> since the json is so free-form).
>>>
>>> So my question is, do you think using the json approach is wrong for
>>> this case? Does it seem better to have separate versioning tables
>>> associated with each real table? Or another approach?
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org[
>>> pgsql-general(at)postgresql(dot)org])
>>> To make changes to your subscription:
>>>
>>> http://www.postgresql.org/mailpref/pgsql-general[http://www.postgresql.org/mailpref/pgsql-general]
>>
>>
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Cedric Berger 2014-10-02 15:11:32 Getting my Database name in a C Extension
Previous Message Adam Brusselback 2014-10-02 14:27:44 Re: table versioning approach (not auditing)