Re: Upsert performance considerations (~1 mil/hour)

From: Fredrik Blomqvist <fredrik(dot)blomqvist(dot)95(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Re: Upsert performance considerations (~1 mil/hour)
Date: 2019-09-04 21:18:27
Message-ID: CADnGOyDRGJ47L7S8On9YWqVVdTj3APg1OC-pFpiOxFBBfdHeXA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Thanks for the response Jeff!

On Wed, Sep 4, 2019 at 3:58 PM Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> On Wed, Sep 4, 2019 at 1:30 PM Fredrik Blomqvist <
> fredrik(dot)blomqvist(dot)95(at)gmail(dot)com> wrote:
>
>> Hi,
>>
>> I have tried doing some research for quite a while on the performance
>> implications of the built-in upsert (INSERT ... ON CONFLICT UPDATE...) when
>> a lot of upserts are made. The scale is something like 1 million
>> records/hour, that is split up in groups of around 300 records each.
>>
>
> How is that done? 300 single-valued insert statements, grouped into on
> transaction? one 300-valued insert statement?
>

It's done using one 300-valued insert.

>
>
>> So far, one hypothesis is that this project seems to be suffering from
>> the large amount of writes that happen constantly since even if the upsert
>> results in no inserts/updates, the "failed" inserts from the upsert will
>> still get written somewhere (according to our knowledge).
>>
>
> You can suppress redundant updates with a trigger, as described
> https://www.postgresql.org/docs/current/functions-trigger.html. This
> works even for updates that are the result of insert..on conflict..update.
> There is still some writing, as each tuple does get locked, but it is much
> less (at least from a WAL perspective). You can also put a WHERE clause
> on the DO UPDATE so it only updates is a field has changed, but you have to
> list each field connected with OR.
>

Didn't know about the trigger method, handy. We were planning on utilizing
the WHERE clause to prevent unnecessary updates, so I suppose that will
make the situation slightly better. However, we are still left with the
unnecessary insert, right? If all 300 values already exist and are up to
date, there will be a failed insert that will have to be vacuumed, right?
Which in turn means that we'd probably need to tune the auto vacuuming to a
more aggressive setting if we want to use this kind of upsert.

> Therefore, the idea is to utilize old-fashioned upserts (writeable CTEs)
>> and do more granular operations that can make sure to only insert data that
>> doesn't already exist, and only update data that has actually changed.
>> Naturally, however, this will put more read-load on the DB and increase
>> query complexity.
>>
>
> It shouldn't put a meaningful additional read load on the database, as the
> ON CONFLICT code still needs to do the read as well. Yes, it makes the
> code slightly more complex.
>

Right, okay. Based on what I have told you so far, would you recommend
going with the old-fashioned upsert or the built-in one? Or is there some
other key information that could swing that decision?

Best,
Fredrik

>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Justin Pryzby 2019-09-05 00:18:33 Re: Incorrect choice of Nested Loop for a skewed distribution
Previous Message Jeff Janes 2019-09-04 19:58:05 Re: Upsert performance considerations (~1 mil/hour)