Re: Insert performance for large transaction with multiple COPY FROM

From: Horst Dehmer <horst(dot)dehmer(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Insert performance for large transaction with multiple COPY FROM
Date: 2013-01-18 23:15:15
Message-ID: 273CF8C2-159B-44FF-B382-61A76541F435@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hey Jeff (and others)!

First of all: Thanks for your detailed explanations and guide lines.

On 17.01.2013, at 18:12, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> So the theory is that the presence of idx_4 is causing the trigger to
> pick a poor plan (i.e. one using idx_4) while its absence removes that
> temptation?

Yes. And auto_explain confirms this for the first record (obj_item_loc_obj_item_id_idx = idx_4 from last my last mail):

2013-01-18 22:50:21 CET LOG: duration: 0.021 ms plan:
Query Text: SELECT * FROM obj_item_loc WHERE obj_item_loc.obj_item_id = NEW.obj_item_id AND obj_item_loc.loc_id = NEW.loc_id AND obj_item_loc.obj_item_loc_ix = NEW.obj_item_loc_ix
Index Scan using obj_item_loc_obj_item_id_idx on obj_item_loc (cost=0.00..8.27 rows=1 width=382)
Index Cond: (obj_item_id = $15)
Filter: ((loc_id = $16) AND (obj_item_loc_ix = $17))
2013-01-18 22:50:21 CET CONTEXT: SQL statement "SELECT * FROM obj_item_loc WHERE obj_item_loc.obj_item_id = NEW.obj_item_id AND obj_item_loc.loc_id = NEW.loc_id AND obj_item_loc.obj_item_loc_ix = NEW.obj_item_loc_ix"
PL/pgSQL function obj_item_loc_before_insert() line 5 at SQL statement
COPY obj_item_loc, line 1: "10903011224100007276 10903010224100015110 10900024100000029720 \N \N \N \N \N \N \N \N \N \N \N \N \..."

and for one of the last records:

2013-01-18 22:53:20 CET LOG: duration: 16.088 ms plan:
Query Text: SELECT * FROM obj_item_loc WHERE obj_item_loc.obj_item_id = NEW.obj_item_id AND obj_item_loc.loc_id = NEW.loc_id AND obj_item_loc.obj_item_loc_ix = NEW.obj_item_loc_ix
Index Scan using obj_item_loc_obj_item_id_idx on obj_item_loc (cost=0.00..8.27 rows=1 width=382)
Index Cond: (obj_item_id = $15)
Filter: ((loc_id = $16) AND (obj_item_loc_ix = $17))

I see a linear increase of the duration from 0.0x ms to over 16 ms (apart from a few nasty outliers with about 22 ms). Although even at the end there are still a few durations < 0.03 but mostly 15 ms and above.

> True disk reads are much more expensive, but given how few reads you
> have relative to hits, I now think that in aggregate the hits are more
> of a concern than the reads are. In other words, you seem to be CPU
> bound, not IO bound.

Yes, definitely CPU bound, as top shows 99+% CPU utilization.

> Even more so I think (but not with much confidence) that most of your
> "reads" are actually coming from the OS cache and not from the disk.
> PG cannot distinguish true disk reads from OS cache reads.
>
> When was the last time you reset the stats? That is, are your
> reported numbers accumulated over several loads, with some having idx4
> and some not?

I set up a fresh database before each test run. So the stats should be clean.

> More formally, use use auto_explain and set
> auto_explain.log_nested_statements to true. I haven't verified this
> works with triggers, just going by the description I think it should.

Nice tip! Works for triggers as well.

> Your use case is a little unusual. If you are bulk loading into an
> initially empty table, usually you would remove the trigger and add it
> after the load (with some kind of bulk operation to make up for
> whatever it was the trigger would have been doing). On the other
> hand, if you are bulk loading into a "live" table and so can't drop
> the trigger, then the live table should have good-enough preexisting
> statistics to make the trigger choose a good plan.

My case is indeed unusual as for the whole model of 276 tables there will never be an update nor a delete on any row.
The model is rather short-lived, from a few hours to a few months. COPY FROM/TO are the only ways to get data into the database and back out. And in between there is lots of graph traversal and calculation of convex hulls. But the lengthy transaction are by far not the common case.

Having said that, I'm no longer sure if a RDBMS is the right tool for the backend. Maybe indexing and storing with a plain full text search engine is. Dunno...

Thanks again!

--
Horst

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message rudi 2013-01-21 15:05:05 High CPU usage after partitioning
Previous Message Jeff Janes 2013-01-17 18:12:06 Re: Insert performance for large transaction with multiple COPY FROM