Re: Question on overall design

From: Ron Johnson <ronljohnsonjr(at)gmail(dot)com>
To: veem v <veema0000(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: Question on overall design
Date: 2023-12-10 16:55:29
Message-ID: CANzqJaBjuapU13MSxefShCaY4vRws_r2x0sMs33HaQbXt-snxw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

* PG has pgbench; *maybe* you can hack it to work on Oracle.
* If you want to know how well an RDBMS will work on your workload, then
you must provide it with a simulated workload. Right?
* AWS RDS Postgresql has a dashboard that *might* be similar to AWR. Or it
might not...
* We departitioned because SELECT statements were *slow*. All partitions
were scanned, even when the partition key was specified in the WHERE clause.

On Sun, Dec 10, 2023 at 8:45 AM veem v <veema0000(at)gmail(dot)com> wrote:

> Thank you so much Ron. I have some more doubts related to this.
>
> We were thinking , if there is any utility in PG with which we can
> create/generate large sample data volume which we can use to run it on our
> on premise Oracle exadata box and use the same on the aurora postgresql in
> cloud to see the read and write performance comparison. Is there any such
> exists, to quickly get some large data volume? But i think , what you are
> saying is, we should use real data to get actual or closest possible
> benchmarks, correct me if wrong?
>
> We used to see the data dictionary views (called AWR views) in Oracle to
> see the current and historical performance statistics like CPU, IO , Memory
> usage, object level contentions etc. in the oracle database. Do we have
> such a thing available in Aurora postgre, so as to monitor the
> performance and get some idea of how well the load test goes and what
> capacity is available or are we saturating it?
>
> When you said "*Beware of excessive partitioning. We had to "departion"
> most tables, because of the impact on performance.*" , as I understand
> partitioning helps in reducing IO while reading , as it scans less data
> because of partition pruning. And while writing there is almost minimal
> overhead in identifying the required partition to which the
> INSERTed/Updated/Deleted data will be landing ON. So can you please help me
> understand what exact performance impact you are highlighting here? Are you
> talking about the size of the partitions or total count of the partitions?
> In our case we are planning to do either daily range partition or hourly
> based on data data volume, not sure if there exists some sweet spot in
> regards to what should be the size of each partition in postgresql be. If
> you are pointing to higher count of partitions of table , then in our case
> if we persist ~90 days data then for a transaction table it would be ~90
> daily range partitions or ~2160 hourly range partitions in the aurora
> postgresql. It would be helpful , if you can explain a bit regarding what
> exact performance impact you faced in regards to the partitioning in aurora
> postgresql.
>
> *"Use ora2pg to export views in the Oracle database. It's very easy; a
> small EC2 VM running Linux with enough disk space lets you automate the
> extraction from Oracle and importation into AWS Postgresql.)"*
>
> Need to explore a bit more on this I believe. We have an oracle on premise
> database, so we can move data directly to aurora postgresql in the cloud.
> Another thing , is we have some sample data available in the AWS snowflake
> but then not sure if some mechanism is available to move the same data to
> the aurora postgresql ?
>
> On Sun, 10 Dec 2023 at 02:27, Ron Johnson <ronljohnsonjr(at)gmail(dot)com> wrote:
>
>> I don't know anything about Aurora, only have experience with RDS
>> Postgresql.
>>
>> We successfully migrated from on-prem Oracle (12c, I think) to RDS
>> Postgresql 12, and were very happy: little down time (I take pride in
>> designing/implementing that; thanks, ora2pg!), with lower disk (8TB, down
>> to 5TB) and CPU usage.
>>
>> I'm not sure what the TPS was in Oracle, but the server level "we" are on
>> (I'm no longer with that company, and don't remember the specifics (48 vCPU
>> / 132 GB RAM, I think, with 10,000 IOPs) is over-powered for *our* needs.
>>
>> You're going to have to spin up a full-sized instance, import a *lot* of
>> real data(*) into a production-identical schema and then run your batch
>> load process using test data (or copies of real batches). That's the only
>> way you'll *really* know.
>>
>> Beware of excessive partitioning. We had to "departion" most tables,
>> because of the impact on performance.
>>
>> (*) Use ora2pg to export views in the Oracle database. It's *very*
>> easy; a small EC2 VM running Linux with enough disk space lets you automate
>> the extraction from Oracle and importation into AWS Postgresql.)
>>
>> On Sat, Dec 9, 2023 at 3:36 PM veem v <veema0000(at)gmail(dot)com> wrote:
>>
>>> Thank you so much for the response.
>>>
>>> Got your point, will check if we really need details or summary for the
>>> historical data. But it looks like we will need detailed transaction data
>>> for ~2 years at least.
>>>
>>> My understanding was that AWS has two different offerings and "aurora
>>> postgresql" is more performant and costlier than "RDS postgresql". Will
>>> double check on this though.
>>>
>>> However , how to prove RDS/aurora postgresql is going to serve our OLTP
>>> requirement here , similar to the current on premise Oracle exadata. For
>>> the OLTP use case we are expecting ~15K TPS write and 2K TPS read and the
>>> response for the UI queries are expected to be within subseconds. But yes,
>>> as i mentioned individual transactions will be batched and then will be
>>> written to the database, so this should have lesser resource consumption
>>> and contention created.
>>>
>>> To test if Aurora postgresql will be comparable to cater the above needs
>>> (in regards to the expected performance with nominal cost) ,how should we
>>> test it? As we won't be able to test everything right away, Should we test
>>> basic read and write performance and benchmark to have some confidence and
>>> go ahead with development?
>>>
>>> Say for example if one transaction consists of ~8 Inserts we can create
>>> a sample target table on aurora Postgre with required indexes/constraints
>>> and try running those inserts from multiple threads(for concurrency) using
>>> blazemeter and see/compare the response time, CPU, IO, Memory usage for the
>>> Postgresql database with set TPS. Similarly to see read performance, we can
>>> run multiple select queries from blazemeter and compare the response time.
>>> Is this the correct approach for validating the database here or any other
>>> approach exists?
>>>
>>> And another question coming to mind, I read in past Vaccum to be a
>>> problem in postgresql, is it going to give trouble in Aurora postgresql
>>> too, for such a highly transactional read/write system? How to
>>> test/validate that?
>>>
>>> On Sun, 10 Dec 2023 at 01:29, Ron Johnson <ronljohnsonjr(at)gmail(dot)com>
>>> wrote:
>>>
>>>> On Sat, Dec 9, 2023 at 2:13 PM veem v <veema0000(at)gmail(dot)com> wrote:
>>>>
>>>>>
>>>>> Ron Johnson <ronljohnsonjr(at)gmail(dot)com>
>>>>> wrote:
>>>>>
>>>>>> "OK" is relative, but it's what we did in a similar situation: two
>>>>>> years of data on-line and 5 years of data in compressed files in S3.
>>>>>> (We're required to keep data for 7 years, but they *never* ask for
>>>>>> records more than 2 years old. If they ever do, we'll manually load the
>>>>>> relevant data back into PG.)
>>>>>> (I can't imagine that querying billions of unindexed flat-file
>>>>>> records via S3 would be fast.)
>>>>>> How often do end users look for data more than 90 days old? Two
>>>>>> years old?
>>>>>> How quickly does the old data need to be available?
>>>>>> Maybe a four-tiered system of PG-Snowflake-S3-S3_Glacier would
>>>>>> balance speed and cost (or maybe Snowflake would just drive up costs).
>>>>>
>>>>>
>>>>> Thank You so much Ron.
>>>>>
>>>>> When you said "*two years of data on-line and 5 years of data in
>>>>> compressed files in S3*." So do you mean two years of data in aurora
>>>>> postgre which will be OLTP database and rest just dump as is in S3 bucket
>>>>> and when we need any such query for those data , just simply read and dump
>>>>> those S3 files back in the aurora postgre?
>>>>>
>>>>
>>>> (RDS Postgresql, not Aurora, but that's beside the point.)
>>>>
>>>> Yes.
>>>>
>>>> But keep reading...
>>>>
>>>> In the currently running oracle exadata system , it has SIX months of
>>>>> data (which is ~200TB) and the transaction tables are all range partitioned
>>>>> on a daily basis. And out of that ~2months of data gets frequently queried
>>>>> and other ~4months of data gets queried less frequently. However, in the
>>>>> target architecture which we want to build on cloud here, there are some
>>>>> requirements for the analytics/data science team to query ~3years history.
>>>>> Beyond ~3years we may need that data rarely.
>>>>>
>>>>> We were initially thinking of just having one database to serve both
>>>>> OLTP and OLAP use cases(somewhat like oracle exadata currently doing for
>>>>> us) but it seems we don't have such an option here on AWS. Postgre will
>>>>> serve OLTP use case whereas Snowflake will serve OLAP use case.
>>>>>
>>>>> So do you suggest having both the databases in use, like recent
>>>>> transaction data for last 3 months should be streamed to aurora postgre,
>>>>> then from 3months till 3years of data should be parked in snowflake which
>>>>> will serve OLAP/analytics use case. and from 3years till 10years will be
>>>>> kept in S3 (as parquet or Iceberg format) so that even Snowflake can query
>>>>> those directly when needed.
>>>>>
>>>>> OR
>>>>>
>>>>> Do you suggest just keeping last ~3months of data on Aurora postgre
>>>>> and rest everything on snowflake considering it will store those as
>>>>> compressed format and also storage is cheap(23$ per TB per month)?
>>>>>
>>>>> Few colleagues are pointing to databricks for the analytics use case.
>>>>> Is that a good option here?
>>>>>
>>>>>
>>>> I can't answer that without knowing what the end users actually need
>>>> (details, or just summaries of historical data, in different tiers).
>>>>
>>>> You all will have to do the cost:benefit analysis of different
>>>> architectures.
>>>>
>>>>
>>>>>
>>>>> On Sat, 9 Dec 2023 at 16:43, veem v <veema0000(at)gmail(dot)com> wrote:
>>>>>
>>>>>> Hello All,
>>>>>> Although it's not exactly related to opensource postgre but want to
>>>>>> ask this question here to understand colleagues' view, considering having
>>>>>> decades of experience in the database world, We want some guidance, if the
>>>>>> below design looks okay for our customer use case.
>>>>>>
>>>>>> We currently have financial systems transaction data streams to
>>>>>> Oracle exadata(X9) on-premise. This database supports processing of
>>>>>> 400million transactions per day. A single transaction for us is a
>>>>>> combination of 7-8 inserts into different transaction tables with Indexes ,
>>>>>> unique constraints etc defined on those. The transactions
>>>>>> processed/committed in batches(~1000 batch size) in the database. And this
>>>>>> system persists data for ~6 months. We do have all sorts of
>>>>>> OLAP(daily/monthly batch reports running) applications run on the same
>>>>>> database along with some user facing UI applications showing customer
>>>>>> transactions. So it's basically currently serving a hybrid workload and is
>>>>>> one stop solution for all use cases.
>>>>>>
>>>>>> Many of the applications are moving from on premise to AWS cloud as
>>>>>> part of modernization journey and AWS being chosen cloud partner also the
>>>>>> product is expected to expand across more regions and this system is
>>>>>> expected to serve increase in the transaction volume. And also we have a
>>>>>> requirement to persist transaction data for ~10years to have those
>>>>>> available for analytics/data science use cases.
>>>>>>
>>>>>> So the team is thinking of splitting it into two parts
>>>>>> 1)OLTP type use case in which we will persist/write the transaction
>>>>>> data faster and show it to the UI related apps , in near real time/quickest
>>>>>> possible time. and this database will store Max 60-90 days of transaction
>>>>>> data. Not sure if we have an option of Oracle exadata equivalent on AWS, so
>>>>>> team planning of using/experimenting with Aurora postgres. Please correct
>>>>>> me, if there are any other options we should use otherwise?
>>>>>>
>>>>>> 2)Then move the data beyond ~90 days into another database or object
>>>>>> storage S3 which will keep it there for ~10 years and will be queryable
>>>>>> using the necessary API's. That is supposed to cater to Olap/analytics/data
>>>>>> science use cases etc.
>>>>>>
>>>>>> Is the above design is okay? and also in regards to the second point
>>>>>> above i.e. persisting the historical data (that to be in queryable state),
>>>>>> should we go for some database like snowflake or should just keep it on S3
>>>>>> as is and make those queryable through APIs. Please advice?
>>>>>>
>>>>>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Vincent Veyron 2023-12-10 17:39:24 Re: running \copy through perl dbi ?
Previous Message veem v 2023-12-10 13:45:08 Re: Question on overall design