ETL - sql orchestrator is stuck when there is not sleep() between queries

From: Allan Barrielle <allan(dot)barrielle(at)gmail(dot)com>
To: pgsql-performance(at)lists(dot)postgresql(dot)org
Cc: Pierrick Martin <pierrick(at)mart1(dot)dev>
Subject: ETL - sql orchestrator is stuck when there is not sleep() between queries
Date: 2021-07-07 23:00:28
Message-ID: CANrBv8rnhJkFySe8qqW=bXiBPKKDjhXpdh=airBPgex3bTp-0A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello guys, I'm facing a problem. Currently I'm working on a Data
transformation Pipeline on Postgres. The strategy is,

We select every tables in a given schema ( 50 tables ), we apply some case
when, translation, enum and load it into a different new schema with a
CREATE TABLE SCHEMA_2.table_1 AS SELECT * FROM SCHEMA_1.TABLE_1, then we do
it again about 3 more times and everytime it’s a new schema, new table. We
only keep and don’t drop the schema1.

To orchestrate the whole, we've got a bunch of .sql files that we run by
using psql directly. That's our "strategy".

So we're copying a lot of data, but it allows us to debug, and investigate
business bugs, because we can plug us into schema 2,3 and search why it's
an issue.

All is fine, and can work great.
But sometimes, some queries that used to take about 20 secs to complete can
suddenly end in 5mins.
Important all queries have the same shape -> CREATE TABLE SELECT AS *(a bit
of transform) FROM TABLE). No update, nothing, it’s dead simple.

We are just trying to copy a table from schema1, to schema2, to schema3 and
finally schema3. That’s it.
The thing to understand here is schema2, schema3 are dropped at every
pipeline transformation, so everytime we run the script, it drops
everything from schema2 to the final stage.

We tuned the config a little bit, and we tried kind of everything (
synchronous_commit, wal, vacuum )
Nothing works, it’s very random, some query won’t simply work ( even after
hours ).

We use different machines, different config, and different datasets.

The only thing that makes it work every time, in 100% cases, is to put a
sleep(10sec) between each schema.
So we select 50 tables, we create a new schema with it, then we sleep 10
sec then we do again the same query but with the freshly created schema and
we create a third schema, sleep 10s and again..

And that makes the whole pipeline successful each time.

So, It seems it's a background process inside postgres, that should ingest
a lot of data, and we have to give him time to take a rest, like a
bg_writers or something else ?
I disabled autovacuum=off . Same.
Why does the query never end even after hours ? Why there is no log about
where the query is stuck.
To be clear, if I kill the stuck query and run again it will work.

I don't know much about what's going on inside Postgres, which randomly
takes a lot of time, with the same code, same data.

PostgreSQL 12.4 (Debian 12.4-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled
by gcc (Debian 8.3.0-6) 8.3.0, 64-bit

Thank you so much for your time..

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Luca Ferrari 2021-07-08 06:50:30 Re: hint in determining effective_io_concurrency
Previous Message Merlin Moncure 2021-07-07 21:42:40 Re: hint in determining effective_io_concurrency