Re: How to perform a long running dry run transaction without blocking

From: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
To: Robert Leach <rleach(at)princeton(dot)edu>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: How to perform a long running dry run transaction without blocking
Date: 2025-02-06 17:15:55
Message-ID: 6d833658-f461-4ad4-a3e1-86d3c515bc18@aklaver.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 2/6/25 08:53, Robert Leach wrote:
> Great questions! Responses inline...

Please reply to list also.
Ccing list

>> Load to where existing table or temporary table?
>
> Existing tables. Note that (and this is a point of contention in our team), we have some fields that must be universally unique across all study data, so if we were to employ temporary (empty) tables, we would miss out on the uniqueness checks.

That is another discussion.

>
> Background: The previous solution for this validation interface that I had tried, used a partial copy of the database which I called the "validation" database. (I'd been forced to do it that way because previous developers hadn't used transactions and their "dry run" mode had side effects that I didn't want to happen due to the validation interface, so instead of engage in a massive refactor, I tried using a separate database as a quick temporary fix.) But mind you, all of the code I was writing was in django (python), and in order to maintain 2 different databases and stay database architecture agnostic, I learned that your code must be littered with hundreds of insertions of `.using()` (and other) statements, and it turned out that that strategy is not 100% compatible with every kind of Django ORM thing you can do (e.g. `full_clean()`), so I'd had to dig around in django core code to ensure every operation was being performed on the correct database. It was a nightmare to maintain and I happily ripped it all out when I corrected the original problems by wrapping everything in a transaction.

Yeah, reason why I bypass the ORM.

>
>>> This validation interface skips the raw data load step, which is the heftiest, most long running, part and generally finishes in well under a minute.
>>
>> So what is the validation actually doing?
>
> The raw data is pretty solid. It is automatically generated by hardware (mass spectrometer) and software (peak analysis suites). So it doesn't need much (if any) validation. What needs validation is all the metadata associated with the samples that is totally generated by hand by the researchers.
>
> They enter the data in an excel spreadsheet containing about a dozen inter-related worksheets, named, for example: Study, Animals, Samples, Tissues, Treatments, Sequences

Well there's your problem:) I will spare you my full spreadsheet rant.

Long term is there a thought to have them enter directly into database
where validation can happen in real time

>> Where are the background processes loading data to?
>
> We first run our loads on a development server with a separate copy of the database, but then we re-run those same loads on the production server, where users perform their validations.

The load to the development server does no validation?

If so what is the purpose?

The background processes are other validation runs?

>
> One of the thoughts I'd had to work around the problem was to somehow dump the data from the development load and load it onto production in some sort of scheduled downtime or something. However, even if we do that, I'm concerned that multiple users concurrently validating different submissions would encounter this blocking issue, and since those validations can take (on the upper end) just under a minute, it's enough for at least 1 user to encounter a timeout. I have not yet proven that can happen, but based on my reading of the postgres documentation, it seems logical.
>

Seems you are looking for some sort of queuing system.

What are the time constraints for getting the validation turned around.

>> Not that I know of and that would be a false promise anyway as I know of no process that is perfect.
>
> I realize that it's a false promise WRT the background load transaction, but it's a moot concern from the perspective of the validation page, because it is making a guaranteed promise that it will never commit. All I want is for it to not be blocked so I can report as many errors as I can to the researcher so they can advance their submission compilation. besides, there's a pretty solid bet that since the load succeeded on the development server, it will succeed on the production server where this block would happen.
>
> So if the load transaction does fail, and the data the validation process was waiting on (to see the outcome) is not reported as problematic to the user, it will be reported as problematic to those responsible for the load on the back-end, so it gets dealt with either way.
>

>>
>> --
>> Adrian Klaver
>> adrian(dot)klaver(at)aklaver(dot)com
>
>
>
> Robert William Leach
> Research Software Engineer
> 133 Carl C. Icahn Lab
> Lewis-Sigler Institute for Integrative Genomics
> Princeton University
> Princeton, NJ 08544
>

--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Paul Foerster 2025-02-06 17:37:25 Re: libc to libicu via pg_dump/pg_restore?
Previous Message Adrian Klaver 2025-02-06 16:31:24 Re: libc to libicu via pg_dump/pg_restore?