Re: [GENERAL] 7.4Beta

From: "Dann Corbit" <DCorbit(at)connx(dot)com>
To: "Andreas Pflug" <pgadmin(at)pse-consulting(dot)de>
Cc: "Stephan Szabo" <sszabo(at)megazone(dot)bigpanda(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] 7.4Beta
Date: 2003-08-15 18:53:30
Message-ID: D90A5A6C612A39408103E6ECDD77B829408B8C@voyager.corporate.connx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> -----Original Message-----
> From: Andreas Pflug [mailto:pgadmin(at)pse-consulting(dot)de]
> Sent: Friday, August 15, 2003 11:36 AM
> To: Dann Corbit
> Cc: Stephan Szabo; PostgreSQL-development
> Subject: Re: [HACKERS] [GENERAL] 7.4Beta
>
> Dann Corbit wrote:
>
> >Simplification of bulk operations can be very important for
> customers
> >(on the other hand). For the CONNX tool set, we offer an escape on
> >INSERT/SELECT that performs the operation in bulk mode.
> >
> >There are serious downsides to bulk operations also (such as
> not being
> >logged and therefore existing outside of a transaction). Therefore,
> >they are useful really only in cases where a complete
> destruction and
> >repopulation is called for. If anything goes haywire, you
> can't simply
> >roll it back.
> >
> >Yet the speed savings can be enormous (orders of magnitude).
> >
> >Compared to iteration over a set of prepared inserts, a bulk insert
> >(such as using Microsoft's BCP API or Oracles Direct Path
> loading) can
> >be 100 times faster. If you are moving gigabytes of data and
> >performing a complete refresh, the method to use becomes obvious.
> >
> >When we go outside of the language bounds, a curly braced escape
> >notation is used. For instance, an insert/select might look
> like this:
> >INSERT INTO <destination_table> SELECT <column_list> FROM
> ><source_table> {fn commitcount 1000} {bulkmode} The commit
> count says
> >to use batches of 1000 rows and bulkmode says to use the fastest
> >possible insert method.
> >
> >Imagine (for instance) that PostgreSQL can use an ODBC {or similar}
> >data source as a table. Then, with a syntax such as the above (or
> >SELECT INTO etc.), you could very rapidly move data from one system
> >into another.
> >
> >
> When saying "bulk operation" I don't necessarily mean using
> bulk load or
> stuff like that. What I mean is handling large amounts of
> similar data
> at the same time. That doesn't say anything about transactions or
> logging problems.
> Imagine you have 100k or rows to load, each having FKs to (hopefully)
> existing rows
> - Step 1: load the rows into a temp table
> - Step 2: identify duplicates, logging and deleting them
> - Step 3: insert all data satisfying the FK constraints.
> - Step 4: log all that didn't insert.
>
> This are relatively few statements (not the simplest), which can be
> handled in a safe manner.
> Just an example, how a RDBMS can do the job in a mass
> oriented (and thus
> optimizable) way.

I really, really like SQL*Server's solution to the problem you have
stated above. A typical example is you have a pile of incoming words
and you want to create a dictionary. An example might be a library
database, where you have an abstract for each item. You parse the words
and look for new ones to add.

What SQL*Server allows is an index with an option called "IGNORE
DUPLICATES" that simply tosses out rows that are already in the table.
For applications like what I have described and what you have described
it is an incredibly useful extension. Once I got used to it, I found
myself using it all the time. Of course, you must be very careful to
ensure that the duplicates really are completely unimportant.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andreas Pflug 2003-08-15 19:13:49 Re: [GENERAL] 7.4Beta
Previous Message Tom Lane 2003-08-15 18:49:41 Re: [GENERAL] 7.4Beta