From: | Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com> |
---|---|
To: | veem v <veema0000(at)gmail(dot)com>, pgsql-general(at)lists(dot)postgresql(dot)org |
Subject: | Re: IO related waits |
Date: | 2024-09-21 17:20:27 |
Message-ID: | 55697e68-ee97-4571-9b71-c91a9373d40b@aklaver.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On 9/21/24 02:36, veem v wrote:
>
>
>
> Actually here the application is using kafka and flink stream and is
> using one of existing code in which it was doing row by row commit which
> is now changed to Batch commit i.e. the commit point is shifted from row
> by row to batch now. There are multiple sessions spawning at the same
> time to process the incoming messages 24/7. And also as described in
> another ongoing thread and also we saw in the past we did not see much
> difference between "batch commit" and "Batch insert" performance. We
> only realized the row by row commit is having worse performance.
The bottom line is that to solve this a cost is going to have to be paid
somewhere. Previously it was done with autocommit in the form of slow
insert performance. You improved the speed of the inserts by wrapping
multiple inserts in transactions and that led you to this problem, where
open transactions across sessions is leading to deadlock issues due to
the same id being inserted in concurrent open sessions. Per my and Greg
Sabino Mullane comments the solution is going to need planning. Right
now you are playing a round of Whack-a-Mole by making ad-hoc changes of
portions of the process without reference to the process as a whole.At
some point the parallel duplicate ids(records) need to be straightened
out into a serial application of data. You and the other people involved
need to come up with a coherent view of the process as whole with a goal
to achieving that. Then you can start planning on where that cost is
best paid: 1) In the data before the streaming. 2) In the streaming
process itself 3) In the database or 4) Spread out across 1-4.
>
> Now, in such a scenario when the messages are coming from a streaming
> framework using kafka +flink and also the insert is happening using row
> by row only (but just the commit is shifted to batch), i don't see any
> way to sort the ID columns in this streaming process, so that they
> won't overlap across session.
>
> In such a situation , the only way seems to have the messages replayed
> for which the deadlock error happens , as I think during a deadlock
> error, one session gets terminated by postgres and that messages perhap
> we can save in some exception table and then replay?
>
--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com
From | Date | Subject | |
---|---|---|---|
Next Message | Ramakrishna m | 2024-09-21 19:08:02 | Logical Replication Delay |
Previous Message | Lok P | 2024-09-21 15:25:13 | Re: How batch processing works |