Re: How to do faster DML

From: Lok P <loknath(dot)73(at)gmail(dot)com>
To: Greg Sabino Mullane <htamfids(at)gmail(dot)com>
Cc: dwhite(at)seawardmoon(dot)com, Marcos Pegoraro <marcos(at)f10(dot)com(dot)br>, pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>, Francisco Olarte <folarte(at)peoplecall(dot)com>, haramrae(at)gmail(dot)com, hjp-pgsql(at)hjp(dot)at
Subject: Re: How to do faster DML
Date: 2024-02-05 11:21:08
Message-ID: CAKna9VY=trErRVWKGSR4DPH0UqE4+wjD5+ZJp35vA3qctT_dug@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, 5 Feb, 2024, 2:21 am Lok P, <loknath(dot)73(at)gmail(dot)com> wrote:

On Sun, Feb 4, 2024 at 9:18 PM Greg Sabino Mullane <htamfids(at)gmail(dot)com>
wrote:

Partitioning will depend a lot on how you access the data. Is it always
using that transaction_date column in the where clause? Can you share the
table definition via \d? We've talked about this table quite a bit, but not
actually seen what it looks like. We are flying blind a little bit. You
mentioned your queries are slow, but can you share an explain analyze on
one of these slow queries?

45 billion rows is quite manageable. How many rows are in each day? You may
want to do a larger bucket than 24 hours per partition.

1)Even after partitioning the target table , to speed up the data load on
this table , Is there an option to disable the primary and foreign keys and
re-enable them post data load finishes. Will that be a good idea

No.

3)As the size of the table or each partition is going to be very large and
this will be a read intensive application, compressing the historical
partition will help us save the storage space and will also help the read
queries performance

I am not sure what this means. If you are not going to need the data
anymore, dump the data to deep storage and drop the partition.

Cheers,
Greg

Thank you.

The table has ~127 columns of different data types , combinations of
Numeric, varchar, date etc. And is having current size ~1TB holding
~3billion rows currently and the row size is ~300bytes.

Currently it has lesser volume , but in future the daily transaction per
day which will be inserted into this table will be Max ~500million
rows/day. And the plan is to persist at least ~3months of transaction data
which will be around 45billion rows in total. And if all works well , we
may need to persist ~6 months worth of data in this database in future and
that will be ~90 billion.

This table will always be queried on the transaction_date column as one of
the filters criteria. But the querying/search criteria can span from a day
to a month worth of transaction date data.

When you said "*You may want to do a larger bucket than 24 hours per
partition.*", do you mean to say partition by weekly or so? Currently as
per math i.e. 1TB of storage for ~3billion rows. So the daily range
partition size( to hold ~500million transactions/day) will be around
~100-150GB. Won't that be too much data for a single partition to operate
on, and increasing the granularity further(say weekly) will make the
partition more bulkier?

What I mean was, we will definitely need the data for querying purposes by
the users, but just to keep the storage space incontrol (and to help the
read queries), we were thinking of having the historical partitions
compressed. And for that , if any compression strategy should we follow on
postgres?

With regards to loading data to the table faster, wondering why you said '
*NO*' to load the data first and enabling/Creating the Primary key and
Foreign key constraint later approach. Because this table is a child table
and the parent is already having data in it, loading data to this table in
presence of PK and FK makes it too slow as it tries to validate those for
each set of rows. So we were thinking if doing it at a later stage at
oneshot will be a faster approach. Please suggest.

I will try to collect some SELECT query and post the explain analyze.
Currently we are trying to get rid of the duplicates.

Regards
Lok

Any thoughts, based on above usage pattern?

While I did the maths based on the current stats with limited data sets.
The partitions size coming as 100 to 150gb as I explained above, if we keep
it daily range partition. Should we have to make it hourly then?

So want some experts view if this going to work fine for a read latency
intensive applications or we should have some different strategy?

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Laurenz Albe 2024-02-05 12:22:29 Re: Question on partitioning
Previous Message Devrim Gündüz 2024-02-05 08:47:39 Re: Yum Update nothing provides libarmadillo.so.12()(64bit) needed by gdal36-libs-3.6.4-6PGDG.rhel9.x86_64 from pgdg-common