json_to_recordset() and CTE performance

From: Matt DeLuco <matt(at)deluco(dot)net>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: json_to_recordset() and CTE performance
Date: 2020-10-21 14:11:32
Message-ID: 3379A61C-D9AC-43F0-81C5-F84717C06774@deluco.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello,

I’ve written an sql function that takes a single JSON parameter, in this case an array of objects each with eight properties (example below.) This function uses json_to_recordset() in a CTE to insert three rows on two tables. It takes nearly 7 minutes to insert my dataset of 8935 records (in JSON), using a small python script (the script parses the JSON in < 1s) and PostgreSQL 13 on my Macbook Air.

As an experiment I wrote a second sql function with 8 parameters representing each property of the JSON objects, requiring a discrete function call per record. I wrote a python script to insert all 8935 records (making as many calls to the sql function) which executed in around 2 minutes.

I’m very much a novice at interpreting EXPLAIN (ANALYZE) and hoping someone can help me better optimize my original function. Both the function and results of explain/analyze are provided below. Is it perhaps a limitation of CTEs or json_to_recordset(), and an entirely different approach is necessary (like the second function I wrote with one call per record?) I was really hoping to make this work, I’ve written a small API to my database using sql and plpgsql functions each taking a single JSON parameter, and my web backend acts almost as an http proxy to the database. It’s a different way of doing things as far as webdev goes, but (for me) an interesting experiment. I like PostgreSQL and really want to take advantage of it.

I’ve tried a few “optimizations”. Removing the final SELECT and returning VOID saves around 1.5 min. Removing some extra JOINs saves a little time but nothing substantial (the joins against account.t_account in some places are to check “ownership” of the record by a given client_id.)

Using a subquery seems like it could be much faster than a CTE, but I don’t know how to insert three rows on two tables using a subquery.

Any advice is appreciated, thank you in advance.

Matt

INPUT:
(The CTE illustrates all the properties and types)

[
{
“bank_account_id”: 1324,
“transaction_id”: “abc123”,
“transaction_date”: “2020-10-20”,

},

]

OUTPUT:
(Not sure what I’ve done to create the nested arrays, but it’s unnecessary..
I can shave off ~1.5min by returning VOID.)

[
[
{
"id": 250185
},
{
"id": 250186
},
...
]
]

FUNCTION:

CREATE OR REPLACE FUNCTION journal.create_with_categories(in_json JSON)
RETURNS JSON
AS $$

WITH data AS (
SELECT
(in_json#>>'{context, client_id}')::BIGINT AS client_id,
nextval(pg_get_serial_sequence('journal', 'id')) AS journal_id,
bank_transaction_type,
x.*
FROM json_to_recordset(in_json->'data')
AS x (
bank_account_id BIGINT,
transaction_id TEXT,
transaction_date DATE,
posted_date DATE,
amount finance.monetary,
description TEXT,
parent_account_id BIGINT,
child_account_id BIGINT
),
LATERAL bank_account.get_transaction_type_by_id(x.bank_account_id, x.amount) AS bank_transaction_type
),
insert_journal_entry AS (
INSERT INTO journal.journal (
client_id, id, bank_account_id,
transaction_id, transaction_date, posted_date,
description
)
SELECT
client_id, journal_id, bank_account_id,
transaction_id, transaction_date, posted_date,
description
FROM data
),
insert_account_entries AS (
INSERT INTO journal.account2journal(
account_id, journal_id, amount, type
)
-- T account
SELECT
t.id,
d.journal_id,
@ d.amount,
CASE WHEN d.bank_transaction_type = 'debit'::transaction_type
THEN 'credit'::transaction_type
ELSE 'debit'::transaction_type
END
FROM data d
LEFT JOIN account.t_account t
ON (t.id = COALESCE(d.child_account_id, d.parent_account_id))
WHERE t.client_id = d.client_id OR t.id IS NULL

UNION ALL

-- bank account
SELECT
t.id, d.journal_id, @ d.amount, d.bank_transaction_type
FROM data d
JOIN bank_account.bank_account b
ON (b.id = d.bank_account_id)
JOIN account.t_account t
ON (t.id = b.t_account_id)
WHERE
t.client_id = d.client_id
)
SELECT json_agg(d) FROM (SELECT d.journal_id AS id FROM data AS d) AS d;

$$ LANGUAGE sql;

EXPLAIN ANALYZE:
(From logs)

Aggregate (cost=24.24..24.25 rows=1 width=32) (actual time=388926.249..388926.371 rows=1 loops=1)
Buffers: shared hit=53877 dirtied=2
CTE data
-> Nested Loop (cost=0.26..4.76 rows=100 width=148) (actual time=183.906..388716.550 rows=8935 loops=1)
Buffers: shared hit=53877 dirtied=2
-> Function Scan on json_to_recordset x (cost=0.01..1.00 rows=100 width=128) (actual time=130.645..142.316 rows=8935 loops=1)
-> Function Scan on get_transaction_type_by_id bank_transaction_type (cost=0.25..0.26 rows=1 width=4) (actual time=0.154..0.156 rows=1 loops=8935)
Buffers: shared hit=18054
CTE insert_journal_entry
-> Insert on journal (cost=0.00..2.00 rows=100 width=96) (actual time=453.563..453.563 rows=0 loops=1)
Buffers: shared hit=79242 dirtied=295
-> CTE Scan on data (cost=0.00..2.00 rows=100 width=96) (actual time=0.006..10.001 rows=8935 loops=1)
CTE insert_account_entries
-> Insert on account2journal (cost=4.86..15.23 rows=2 width=52) (actual time=816.381..816.381 rows=0 loops=1)
Buffers: shared hit=159273 dirtied=335 written=17
-> Result (cost=4.86..15.23 rows=2 width=52) (actual time=0.206..109.222 rows=17870 loops=1)
Buffers: shared hit=5
-> Append (cost=4.86..15.20 rows=2 width=52) (actual time=0.197..95.060 rows=17870 loops=1)
Buffers: shared hit=5
-> Hash Left Join (cost=4.86..7.14 rows=1 width=52) (actual time=0.195..35.512 rows=8935 loops=1)
Hash Cond: (COALESCE(d_1.child_account_id, d_1.parent_account_id) = t.id)
Filter: ((t.client_id = d_1.client_id) OR (t.id IS NULL))
Buffers: shared hit=2
-> CTE Scan on data d_1 (cost=0.00..2.00 rows=100 width=68) (actual time=0.004..6.544 rows=8935 loops=1)
-> Hash (cost=3.27..3.27 rows=127 width=16) (actual time=0.137..0.137 rows=127 loops=1)
Buffers: shared hit=2
-> Seq Scan on t_account t (cost=0.00..3.27 rows=127 width=16) (actual time=0.026..0.073 rows=127 loops=1)
Buffers: shared hit=2
-> Hash Join (cost=3.80..8.03 rows=1 width=52) (actual time=40.182..53.796 rows=8935 loops=1)
Hash Cond: ((t_1.id = b.t_account_id) AND (t_1.client_id = d_2.client_id))
Buffers: shared hit=3
-> Seq Scan on t_account t_1 (cost=0.00..3.27 rows=127 width=16) (actual time=0.022..0.079 rows=127 loops=1)
Buffers: shared hit=2
-> Hash (cost=3.59..3.59 rows=14 width=60) (actual time=40.118..40.118 rows=8935 loops=1)
Buffers: shared hit=1
-> Hash Join (cost=1.32..3.59 rows=14 width=60) (actual time=0.071..17.863 rows=8935 loops=1)
Hash Cond: (d_2.bank_account_id = b.id)
Buffers: shared hit=1
-> CTE Scan on data d_2 (cost=0.00..2.00 rows=100 width=60) (actual time=0.005..3.740 rows=8935 loops=1)
-> Hash (cost=1.14..1.14 rows=14 width=16) (actual time=0.030..0.030 rows=14 loops=1)
Buffers: shared hit=1
-> Seq Scan on bank_account b (cost=0.00..1.14 rows=14 width=16) (actual time=0.012..0.016 rows=14 loops=1)
Buffers: shared hit=1
-> CTE Scan on data d (cost=0.00..2.00 rows=100 width=8) (actual time=183.918..388812.950 rows=8935 loops=1)
Buffers: shared hit=53877 dirtied=2
Trigger for constraint journal_client_id_fkey on journal: time=194.194 calls=8935
Trigger for constraint journal_bank_account_id_fkey on journal: time=204.014 calls=8935
Trigger trigger_journal_import_sequence on journal: time=373.344 calls=1
Trigger for constraint account2journal_account_id_fkey on account2journal: time=580.482 calls=17870
Trigger trigger_debits_equal_credits_on_insert on account2journal: time=116.653 calls=1

Responses

Browse pgsql-general by date

  From Date Subject
Next Message David G. Johnston 2020-10-21 15:30:35 Re: temp table same name real table
Previous Message Adrian Klaver 2020-10-21 13:10:57 Re: using psql 11.4 with a server 13.0 && meta commands