Where the data stitching/update/deduplication should happen

From: yudhi s <learnerdatabase99(at)gmail(dot)com>
To: pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Where the data stitching/update/deduplication should happen
Date: 2024-02-29 06:34:36
Message-ID: CAEzWdqeEq13ywQxmAaN1iD2JJDuSU9HgLx7ov1NhFqeg9vRKQA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello All,
It's related to data flow to OLTP database which is mostly going to be
postgres.

We are designing a system which is going to move data from input files(in
Avro format) to Goldengate to kafka topics to the database. Incoming
files-->GGS--> KAFKA-->OLTP Database. This would be a heavy transactional
system processing ~10K txn/second. The database is supposed to show the
near real time transactions to the users. The transactions which come from
kafka topics will be asynchronous in nature and also there are chances of
duplicate data being ingested from kafka topics. So the data has to be
stitched/ updated/deduplicated before showing it to the users a complete
transaction or say before persisting it to the normalized data model which
would be ready for the querying by the end users.

So where should we perform these stitching/update/deduplication stuff in
this workflow? Should it happen inside the application somewhere in the
kafka consumer(using poison pill concept) or should it happen in a stage
schema in the database by persisting all the pieces of transaction as it is
coming from kafka topics. Adding another stage layer within the database is
going to add some more time to the data to be visible to the users and thus
it may not be near real time. As it will take some more time to move the
data to the main transaction/normalized tables from the stage tables.

Or should we persist the data as is in the stage area and show the data
from stage itself if some users are okay with partial transaction data and
showing the complete transaction data from the normalized table to other
users who want to see it as a complete transaction but with some delay?

What is the appropriate design to address such use cases?

Regards
Yudhi

Browse pgsql-general by date

  From Date Subject
Next Message Dominique Devienne 2024-02-29 07:55:19 Re: Non-Stored Generated Columns
Previous Message Mark Schloss 2024-02-29 04:39:49 RE: walreceiver fails on asynchronous replica [EXTERNAL] [SEC=UNOFFICIAL]