Upsert performance considerations (~1 mil/hour)

From: Fredrik Blomqvist <fredrik(dot)blomqvist(dot)95(at)gmail(dot)com>
To: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Upsert performance considerations (~1 mil/hour)
Date: 2019-09-04 17:29:45
Message-ID: CADnGOyCbRZuhFrZNFTaMr=i-K8ENLMU+fyJyZhk5s7FB2isC=Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi,

I have tried doing some research for quite a while on the performance
implications of the built-in upsert (INSERT ... ON CONFLICT UPDATE...) when
a lot of upserts are made. The scale is something like 1 million
records/hour, that is split up in groups of around 300 records each.

The table is fairly simple, just a few int/bool fields and a date field.
The total size of the table is somewhere between 50-60 million records.
Essentially, all the rows are supposed to stay up-to-date within a certain
cycle (that is currently about 2 days). Sometimes a lot of information
changes, sometimes very little/no information changes (e.g. 300 records get
upserted but they are identical to the existing records).

So far, one hypothesis is that this project seems to be suffering from the
large amount of writes that happen constantly since even if the upsert
results in no inserts/updates, the "failed" inserts from the upsert will
still get written somewhere (according to our knowledge). Therefore, the
idea is to utilize old-fashioned upserts (writeable CTEs) and do more
granular operations that can make sure to only insert data that doesn't
already exist, and only update data that has actually changed. Naturally,
however, this will put more read-load on the DB and increase query
complexity.

The question is, are we right in our assumptions that the built-in upsert
is useless from a performance perspective (e.g. it's only good for much
smaller tasks) or are we wrong? I read a bit about HOT updates and
autovacuum tuning, but nothing that references something more similar to
this question.

Worth mentioning is that this DB (PostgreSQL 10.9) is running on Heroku so
we are not able to tune it to our needs. We are planning to move at some
point, depending on how important it ends up being. Finally, it is also
worth mentioning that this DB also has one follower (i.e. read replica).

Would really appreciate some good insight on this! Thanks beforehand.

Best,
Fredrik

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Jeff Janes 2019-09-04 19:58:05 Re: Upsert performance considerations (~1 mil/hour)
Previous Message Oleg Kharin 2019-09-03 17:47:42 Incorrect choice of Nested Loop for a skewed distribution