Re: How to perform a long running dry run transaction without blocking

From: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
To: Robert Leach <rleach(at)princeton(dot)edu>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: How to perform a long running dry run transaction without blocking
Date: 2025-02-06 19:13:36
Message-ID: b89a7132-7fa5-4229-a03c-20f5d1e11cf4@aklaver.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 2/6/25 09:58, Robert Leach wrote:

>>> They enter the data in an excel spreadsheet containing about a dozen
>>> inter-related worksheets, named, for example: Study, Animals,
>>> Samples, Tissues, Treatments, Sequences
>>
>> Long term is there a thought to have them enter directly into database
>> where validation can happen in real time

> Having everything work with excel actually made the site more attractive
> to the researchers, because they're comfortable with it and use it
> already, so it lowered the bar for using our software.

[...]

>  It's an interesting challenge, but I think we're
> pretty committed now on this file load path.

Ok, that is generally how my discussions on this subject end. It's my
windmill.

>> The load to the development server does no validation?
>>
>> If so what is the purpose?
>>
>> The background processes are other validation runs?
>
> It's the same code that executes in both cases (with or without the
> `--validate` flag).  All that that flag does is it (effectively) raises
> the dry run exception before it leaves the transaction block, so it
> always validates (whether the flag is supplied or not).

More for my sake then anything else, why do the load to the development
server at all if the production load is the only one that counts?

> So the load doesn't fail until the end of the run, which is inefficient
> from a maintenance perspective.  I've been thinking of adding a
> `--failfast` option for use on the back end.  Haven't done it yet.  I
> started a load yesterday in fact that ran 2 hours before it buffered an
> exception related to a newly introduced bug.  I fixed the bug and ran
> the load again.  It finished sometime between COB yesterday and this
> morning (successfully!).

Alright I am trying to reconcile this with from below, 'The largest
studies take just under a minute'.

>> Seems you are looking for some sort of queuing system.
>>
>> What are the time constraints for getting the validation turned around.
>
> I have considered a queuing system, though when I previously floated a
> proof of concept using celery, I was informed it was too much.  Though,
> at the time, all I was trying to do was a progress bar for a query stats
> feature.  So proposing celery in this instance may get more traction
> with the rest of the team.
>
> Most of the small validation processes finish in under a dozen seconds.
>  The largest studies take just under a minute.  I have plans to
> optimize the loading scripts that hopefully could get the largest
> studies down to a dozen seconds.  If I could do that, and do the back
> end loads in off-peak hours, then I'd be willing to suffer the rare
> timeouts from concurrent validations.  The raw data loads will still
> likely take a much longer time.

This is where I get confused, probably because I am not exactly sure
what constitutes validation. My sense is that involves a load of data
into live tables and seeing what fails PK, FK or other constraints.

If that is the case I am not seeing how the 'for real' data load would
be longer?

At any rate I can't see how loading into a live database multiple sets
of data while operations are going on in the database can be made
conflict free. To me it seems the best that be done is:

1) Reduce chance for conflict by spreading the actions out.

2) Have retry logic that deals with conflicts.

>
>
> Robert William Leach
> Research Software Engineer
> 133 Carl C. Icahn Lab
> Lewis-Sigler Institute for Integrative Genomics
> Princeton University
> Princeton, NJ 08544
>

--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Robert Leach 2025-02-06 19:25:57 Re: How to perform a long running dry run transaction without blocking
Previous Message Rob Sargent 2025-02-06 18:53:38 Re: How to perform a long running dry run transaction without blocking