Re: Creating table with data from a join

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Igor Stassiy <istassiy(at)gmail(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Creating table with data from a join
Date: 2015-07-14 11:42:13
Message-ID: CAKJS1f96+B5TT2uEoGZJ9xyBaik1pcMay-CEOgcqAqbNKZLa2A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 14 July 2015 at 21:12, Igor Stassiy <istassiy(at)gmail(dot)com> wrote:

> Hello,
>
> I am benchmarking different ways of putting data into table on table
> creation:
>
> 1. INSERT INTO c SELECT * FROM a JOIN b on a.id = b.id;
> 2. CREATE TABLE c AS SELECT * FROM a JOIN b on a.id = b.id;
> 3. psql -c "COPY (SELECT * FROM a JOIN b on a.id = b.id) TO STDOUT" |
> parallel --block 128M --jobs 4 --pipe psql -c "COPY c FROM STDIN";
>
> (the parallel command is available as part of parallel deb package in
> Ubuntu for example, it splits the stdin by newline character and feeds it
> to the corresponding command)
>
> Both tables a and b have ~16M records and one of the columns in a is
> geometry (ranging from several KB in size to several MB). Columns in b are
> mostly integers.
>
> The machine that I am running these commands on has the following
> parameters:
>
> default_statistics_target = 50 # pgtune wizard 2012-06-06
> maintenance_work_mem = 1GB # pgtune wizard 2012-06-06 constraint_exclusion
> = on # pgtune wizard 2012-06-06 checkpoint_completion_target = 0.9 # pgtune
> wizard 2012-06-06 effective_cache_size = 48GB # pgtune wizard 2012-06-06
> work_mem = 80MB # pgtune wizard 2012-06-06
> wal_buffers = 8MB # pgtune wizard 2012-06-06
> checkpoint_segments = 16 # pgtune wizard 2012-06-06
> shared_buffers = 16GB # pgtune wizard 2012-06-06
> max_connections = 400 # pgtune wizard 2012-06-06
>
> One would expect the 3rd option to be faster than 1 and 2, however 2
> outperforms both by a large margin (sometimes x2). This is especially
> surprising taking into account that COPY doesn't acquire a global lock on
> the table, only a RowExclusiveLock
> (according to
> http://www.postgresql.org/message-id/10611.1014867684@sss.pgh.pa.us)
>
> So is option 2 a winner by design? Could you please suggest other
> alternatives to try (if there are any)? And what might be the reason that 3
> is not outperforming the other 2?
>

I would imagine that the calling of the output function to translate the
each value's internal representation to it's user visible/external
representation plus all the overhead of sending results to the client would
be a likely candidate of the slow down. In either case 3 would only be as
fast as the query generating the output. With 1 and 2 all the tuple
representations of each record stays in the internal format.

If you have some logical way to break the query down into parts, then maybe
that would be a place to look.
For example:

INSERT INTO c SELECT * FROM a JOIN b on a.id = b.id WHERE a.id < 8000000;
INSERT INTO c SELECT * FROM a JOIN b on a.id = b.id WHERE a.id >= 8000000;

Of course, you'd need to be very careful to ensure that the results of each
SELECT never overlap. It would be nice to invent some better way than this
that divided the workload evenly even when the tables grow.

Then you could run these concurrently.

Regards

David Rowley

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Igor Stassiy 2015-07-14 11:53:58 Re: Creating table with data from a join
Previous Message Charles Clavadetscher 2015-07-14 11:17:14 Where to place suggestions for documentation improvements