Improving performance of merging data between tables

From: Pawel Veselov <pawel(dot)veselov(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Improving performance of merging data between tables
Date: 2014-12-28 21:49:59
Message-ID: CAMnJ+BeeN-J3d+b9UPfDLkN0B=uWnFEXZkdON5XRB8EBnLZvUw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi.

I was wondering if anybody would have any ideas on how to improve certain
operations that we are having.

PostgreSQL 9.3.3. Table layout:

main_table: PK, N key columns, M data columns. The key columns are bound in
a unique key together. PK is pure sequence number. There are few separate
non-unique indices on some of the key columns, to aid in querying the table
data.

second_table: PK, main_table_PK_REF (declared as foreign key), Na key
columns, Ma data columns. There are 0-inf (typically 0-10) entries in
second_table that reference a single entry in main_table. PK is pure
sequence number, and unique key for that table is (main_table_PK_REF + key
columns). The only non-obvious thing here is that some of the unique
indexes involve coalesce() function:

"uq_brkioevent" UNIQUE, btree (mainid, blockid, name, subid,
(COALESCE(cname, ''::character varying)))

Those are there because we need to write the NULL into the tables, but
unique indexes don't like NULL values (AFAIR).

There is one main_table, and 10 second_tables.

The amount of rows in main table right now is ~1.1M, second tables have
about 1M-2M of rows. The growth of the main table is fixed amount of rows
(~10) per hour.

Multiple (web) application nodes need to write data into this table. Most
of the operations are modifying the data columns, rather than inserting new
data. We had serious contention problems if we let all the application
nodes write directly into the table. The writes involved using prepared
transactions, the prepared transaction can only be reaped after other data
stores are written to, and there is very high chance nodes will need to
modify the same rows, so the locking was taking too long.

To relieve the contention, we have allocated each application node it's own
set of tables that structurally are exactly like main/second tables. The
application node will open a transaction, write all the data into its own
tables, free of locks, and then call a pgsql function to merge the data
from its tables into the common tables. There is typically relatively
little data in the node tables (say within 100 rows in any table) before
its merged into the common tables. Nodes would dump their data when there
is something to dump, can be few times a second.

Recently, the operation that moves the data from the node tables into the
common tables started being a real drain on the PostgreSQL server CPU. I
assume this is probably due to the data set size reaching some critical
mass. Things really got outta hand when we had to double the amount of
application nodes to accommodate surge in application use.

The merging is done in the following manner.
Each main/second table has an associated PL/pgSQL function (merge_xxx) that
takes in key and data values as arguments. It then, in an endless loop,
tries to execute UPDATE statement (using math operations to update the data
based on existing and input data values, using key data in the query part).
If UPDATE statement set "found", then the function exists. Otherwise, the
function tries to INSERT with key/data values. If that succeeds, function
exists, else if unique_violation is thrown, loop continues.

On top of these individual functions, there is another PL/pgSQL function
(merge_all). It uses "for ROW in select * from MAIN_NODE" outer loop, and
within that loop it calls the merge_xxx for the main table, and then for
each secondary table, does the same "for ROWx in select * from
SECOND_NODE", adding WHERE clause to only pick up entries that correspond
to the current main_node table entry that's being processed, calling
merge_xxx for the corresponding secondary table. At the end of the outer
loop, all data from node tables is removed (using DELETE). I will gladly
provide pseudo-code, or even the function body is my explanation is unclear.

Besides "can somebody please look at this and let me know if I'm doing
something utterly stupid", here are my questions.

1) How do I find out what exactly is consuming the CPU in a PL/pgSQL
function? All I see is that the calls to merge_all() function take long
time, and the CPU is high while this is going on.

2) Is there a better way to merge individual rows, except doing
UPDATE/INSERT in a loop, and would that be CPU expensive?

3) Is there a better way to merge whole tables? However, note that I need
to translate primary keys from node main table into the common main table,
as they are used as foreign keys, hence the loops. I suspect the looping is
CPU intensive.

Thank you,
Pawel.

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Cory Zue 2014-12-28 22:57:34 Re: help troubleshooting invalid page header error
Previous Message Adrian Klaver 2014-12-28 20:47:20 Re: Rollback on include error in psql