How should we design our tables and indexes

From: veem v <veema0000(at)gmail(dot)com>
To: pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: How should we design our tables and indexes
Date: 2024-02-10 19:38:36
Message-ID: CAB+=1TXnsSW384gsJxOFvzJt=+coUWh5my39mkHwZEEWPFCsWQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello,
We want to have the response time in <1 sec for our UI search query
requirement. These will be pagination queries. These read queries will be
on big transaction tables (will have ~500+ attributes approx will have
approx. rows size of ~1KB) having a continuous stream of inserts consumed
from source. And these tables will be a daily range partitioned on the
processing_date column. Each partition is going to hold approx ~450million
rows when it will serve in full capacity to all the customers transactions.

The customer will be given the capability to search on a Max date range of
30 days of transaction data i.e ~30 range partitions and are supposed to
get the query response back in <1 sec as it will be UI from which those
queries will be submitted.

1)Is there any way in postgres to influence the optimizer for the
"first_row" optimization, so that it won't go for evaluating all the rows
from such UI search queries. As because these will be pagination queries
and the user will be interested in seeing top 100 transactions in the first
page asap?

2) For e.g below is one sample query from a similar system but in a
different database. Want to understand , what would be the appropriate
indexes to make this above search query run in the quickest possible time?

one Index on table1(MID) , one index Table1(CID), one index on
table2(ACN_NBR)?
OR
Should we create a composite index here combining PR_ID i.e (PR_ID, MID),
(PR_ID, CID), (PR_ID, ACN_NBR) as that is the most unique attribute here?

select count(*) over() as total_record, *
from
(select .......
from TABLE1
Left join schema1.TABLE2 on TABLE2.PR_ID = TABLE1.PR_ID and
TABLE2.MID = TABLE1.MID
and TABLE2.processing_date=TABLE1.processing_date
where TABLE2.processing_date between '2023-04-20' and
'2023-05-21'-- Partition pruning
and TABLE2.ACN_NBR = 'XXXX'
and ( TABLE1.MID in (XXXXXX) OR TABLE1.CID in (XXXXXX))
order by TABLE1.PR_TIME DESC
)
limit 100 offset 0;

The data pattern for the columns used in predicate are as below:- Table1
will be the driving table.

count(distinct ACN_NBR) - 25million
count(distinct MID) - 223k
count(distinct CID) - 59k
count(*)from table1 and table2- ~350 million
PR_ID is a unique key.

3)One of the use cases is that the customer should be able to search on
certain attributes and should be able to see the transactions in "desc by
processing_date" i.e. latest transactions on the first page on the UI. And
in such scenario, if the search attribute is less unique and the customer
puts a date range of a month i.e. over 30 partitions , it may results in
scanning and sorting billions of rows to get the top/recent ~100
transactions and most likely not going to respond back in <1 sec, even goes
for the index. So how should we handle or design indexes for catering such
queries? For e.g. if we have the only filter on column "TABLE1.CID" in the
above query, which is very less unique then how to handle it?

4)And finally, the parameter "settings" which we have currently is as
below for the current system. So I wanted to understand, if they are
accurate or any oddity and we should change those to cater such
requirements?

For e.g. if we set "max_parallel_workers_per_gather"=4 to speed up the
queries, then we will be able to serve only 32/4=8 concurrent user requests
at any point in time. If we are targeting to serve ~100 concurrent users ,
will it be advisable to change or we should test the system with default
i.e. not setting this parallel parameter?

*Settings: *
effective_cache_size = '176237472kB', maintenance_io_concurrency = '1',
max_parallel_workers = '32', max_parallel_workers_per_gather = '4',
search_path = 'public, public, "$user"', temp_buffers = '16MB', work_mem =
'2GB', enable_partitionwise_join = 'on'

Regards
Veem

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Peter J. Holzer 2024-02-11 00:25:28 Re: How to do faster DML
Previous Message Marc Millas 2024-02-09 15:29:13 Re: Partitioning options