600 million rows of data. Bad hardware or need partitioning?

From: Arya F <arya6000(at)gmail(dot)com>
To: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: 600 million rows of data. Bad hardware or need partitioning?
Date: 2020-05-04 03:27:54
Message-ID: CAFoK1ay01_ND0QrCcAAHX7U=rTCssB8vfOiH+iv5b8Qh6WN80g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I have created the following table to duplicate my performance
numbers, but I have simplified the table for this question.

I'm running PostgreSQL 12 on the following hardware.

Dual Xeon Quad-Core E5320 1.86GHz
4GB of RAM

The table structure is

id uuid
address_api_url text
check_timestamp timestamp with time zone
address text
Indexes:
"new_table_pkey" PRIMARY KEY, btree (id)
"test_table_check_timestamp_idx" btree (check_timestamp)

Right now the table has 100 Million rows, but I expect it to reach
about 600-700 Million. I am faced with slow updates/inserts and the
issue is caused by the indices as it gets updates on each
insert/update, If I remove the indexes the insert performance remains
excellent with millions of rows.

To demonstrate the update performance I have constructed the following
query which updates the timestamp of 10000 rows

UPDATE test_table set check_timestamp = now() FROM(select id from
test_table limit 10000) AS subquery where test_table.id = subquery.id;

That update took about 1 minute and 44 seconds
Time: 104254.392 ms (01:44.254)

Below is the EXPLAIN ANALYZE

EXPLAIN ANALYZE UPDATE test_table set check_timestamp = now()
FROM(select id from test_table limit 10000) AS subquery where
test_table.id = subquery.id;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Update on test_table (cost=0.57..28234.86 rows=10000 width=160)
(actual time=102081.905..102081.905 rows=0 loops=1)
-> Nested Loop (cost=0.57..28234.86 rows=10000 width=160) (actual
time=32.286..101678.652 rows=10000 loops=1)
-> Subquery Scan on subquery (cost=0.00..514.96 rows=10000
width=56) (actual time=0.048..45.127 rows=10000 loops=1)
-> Limit (cost=0.00..414.96 rows=10000 width=16)
(actual time=0.042..26.319 rows=10000 loops=1)
-> Seq Scan on test_table test_table_1
(cost=0.00..4199520.04 rows=101204004 width=16) (actual
time=0.040..21.542 rows=10000 loops=1)
-> Index Scan using new_table_pkey on test_table
(cost=0.57..2.77 rows=1 width=92) (actual time=10.160..10.160 rows=1
loops=10000)
Index Cond: (id = subquery.id)
Planning Time: 0.319 ms
Execution Time: 102081.967 ms
(9 rows)

Time: 102122.421 ms (01:42.122)

with the right hardware can one partition handle 600 millions of rows
with good insert/update performance? if so what kind of hardware
should I be looking at? Or would I need to create partitions? I'd like
to hear some recommendations.

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Michael Lewis 2020-05-04 03:45:48 Re: 600 million rows of data. Bad hardware or need partitioning?
Previous Message Justin Pryzby 2020-05-03 15:38:26 Re: Please help! Query jumps from 1s -> 4m