Improve postgres 13.2 performance for concurrent bulk insert

From: Ivan Petrov <capacytron(at)gmail(dot)com>
To: pgsql-admin(at)lists(dot)postgresql(dot)org
Subject: Improve postgres 13.2 performance for concurrent bulk insert
Date: 2021-03-29 23:20:30
Message-ID: CAEARqsGJ3KMCY9MfENnHaxfzux9qZz91gqMG6ehFxANJw42Fbg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi,

What I'm trying to do:

- I have 3rd party DB (any type Postgres, Oracle...) I have no control
over it. Let's call it REMOTE_DB
- I have my own Postgres instance 13.2. I fully control it. Let's call
it MY_PG
- REMOTE_DB has a 20-200M records table
- I need from time to time to fetch 100K - 1M records from REMOTE_DB,
insert into MY_PG and then do sophisticated joins to produce some
analytical results. Let's call it REMOTE_DB_FETCH
- REMOTE_DB_FETCH saves records to MY_PG
- MY_PG has table that mimics schema of source table stored in
REMOTE_DB. It mimics PK too.
- there are concurrent processes that run REMOTE_DB_FETCH to insert data
into MY_PG

Long story short:

- I cache subset of REMOTE_DB at MY_PG
- Do analytics in MY_PG

What I've tried to far

- Concurrent REMOTE_DB_FETCH work with UPSERT "on conflict do nothing"
- I consider to disable vacuum too:
https://gist.github.com/valyala/ae3cbfa4104f1a022a2af9b8656b1131

What else can I try in order to UPSERT from REMOTE_DB to MY_PG faster from
concurrent REMOTE_DB_FETCH processes?

I've created isolated test to try out performance. Inserting around 48K
rows into a table with 150 fields with only PK constraint on to fields:
varchar and bigint.

Table is unlogged

isUnlogged: [{relpersistence=u, relname=cache_1694},
{relpersistence=u, relname=cache_1694_pkey}]

sql

INSERT INTO "cache_1694" ("varchar_column", "bigint_column",
"another_column" /* more column 150 in total */) VALUES (?, ?, ? /**/)
ON CONFLICT ("varchar_column", "bigint_column") DO NOTHING;

It took 30 sec to insert 48819 records. looks slow, what can I try else?
I'm running postgres on my Mac Pro with an SSD disk. Here are the settings:

postgres:
image: postgres:13.2
volumes:
- ~/pgdata:/var/lib/postgresql/data
ports:
- "5432:5432"
environment:
- POSTGRES_USER=pguser
- POSTGRES_PASSWORD=pguser
- POSTGRES_DB=pgdb
- PGDATA=/var/lib/postgresql/data/pgdata

command:
- "postgres"
- "-c"
- "max_connections=50"
- "-c"
- "shared_buffers=1GB"
- "-c"
- "effective_cache_size=1GB"
- "-c"
- "synchronous_commit=off"

SELECT * FROM pg_settings where pg_settings.name like '%commi%'

shows

{
"name": "synchronous_commit",
"setting": "off"
}
]

Strange thing is that LOGGED table performance is not much greater than
UNLOGGED. Same 30 sec for 48819 inserted duplicated rows

Browse pgsql-admin by date

  From Date Subject
Next Message Devendra Yadav 2021-03-30 07:06:15
Previous Message Peter Geoghegan 2021-03-29 20:49:09 Re: Temporary Files