Re: Using the database to validate data

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: JPLapham <lapham(at)jandr(dot)org>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Using the database to validate data
Date: 2015-07-24 16:47:16
Message-ID: CAMkU=1zh4nr3U5ggM90kGLmBSeR_vzRJpAtdOyioRws0AEqDRA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, Jul 23, 2015 at 5:55 AM, JPLapham <lapham(at)jandr(dot)org> wrote:

> Hello,
>
> I have an application that occasionally performs large batch inserts of
> user
> hand-generated data. Input is a tab delimited file with typically hundreds
> to a thousand lines of data.
>
> Because the data is generated by hand, there are always many
> transaction-stopping errors in a typical input run. For example, missing
> datum in a NOT NULL column, a duplicate value in a UNIQUE column, data type
> mismatch, FOREIGN KEY reference to something non-existing, etc. Of course,
> we chose PostgreSQL exactly because of these problems, because of the
> robust
> transactional control, rollback on errors, etc.
>
> My question is the following. I would like to *test* the data input for
> integrity in such a way that I can create a report to the user informing
> them of exactly where in their input file to correct the problems.
>
> IDEA 1: My first attempt at this was to simply slurp the data into the
> database, collect the errors, and then rollback. Of course (as I now know),
> this doesn't work because after the first problem, the database reports,
> "current transaction is aborted, commands ignored until end of transaction
> block". This means that I can only report to the user the location of the
> first problem, and then they run the data again, and keep looping through
> the process until the data is good, a huge waste of time.
>

This is how I usually do it, until it become unbearable or an order comes
down from on high to do it differently. If the errors are due to people
being sloppy, then it ought to be annoying for them to be sloppy. Why make
it convenient for them? If the errors are more excusable than just
sloppiness, or if the annoyance is more to you than to the people creating
the errors, then you have to go on to other methods.

>
> IDEA 2: My second idea on how to do this was to ROLLBACK after each INSERT.
> This allows me to check for things like NOT NULL and data type issues, but
> not violations of UNIQUE within the new data.
>

Issue a savepoint before each insert, and then issue a "release savepoint"
if the insert succeeds or a "rollback savepoint" if it does not. If you
release a savepoint, remember that fact so that at the end you rollback the
entire transaction instead of committing it. I rarely actually resort to
this. It might miss some errors in which one failed row failed for
multiple reasons, or where one row would have failed had another row not
already failed for a different reason.

>
> IDEA 3: Write my own data pre-conditioner. Ugh, what a nightmare, I feel
> like I'm writing my own database! Checking for FKEY constraints, UNIQUE,
> etc
> is not trivial. It seems ridiculous to do this when I have the *actual*
> database available to test against!
>

I do this one a lot when I can't get away with method 1, and I don't see
why it is a nightmare. Writing queries against the existing database to
see if the new proposed keys exist *is* trivial. You write them in SQL,
not a low level language.

Testing for internal duplication within the new dataset is a bit harder, I
usually do that in Perl with a hash. There might be cases where Perl and
PostgreSQL disagree about when two values are equal, but I've almost never
run into them in practise.

Check constraints or character encoding issues or typing issues can be
harder to deal with. If those are likely to be a problem, create a temp or
unlogged table with the same check constraints as the real table but
without the unique or foreign key constraints and see if each row inserts.

Cheers,

Jeff

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Chris Withers 2015-07-24 16:59:03 pgbench tps drop from 5000 to 37 going from localhost to a server 13ms away
Previous Message Alejandro Taboada 2015-07-24 14:30:55 BDR Error on Debian wheezy packages