Re: Creating table with data from a join

From: Julien Rouhaud <julien(dot)rouhaud(at)dalibo(dot)com>
To: Igor Stassiy <istassiy(at)gmail(dot)com>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Creating table with data from a join
Date: 2015-07-14 22:39:52
Message-ID: 55A58FB8.2000304@dalibo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 14/07/2015 18:50, Igor Stassiy wrote:
> Julien, I would gladly provide more information, I am just not sure what
> to add.
>

Well, was your concern about why option #2 is the quickest, or is this
runtime with option #2 still too slow for you ?

> I would be willing to leave the server compromised for things like
> corrupts or data losses during the time of this import, but the server
> has to be up and running before and after the import, if it is
> successful (so I can't take it down then change some parameters and
> start it up with again).

Check http://www.postgresql.org/docs/current/static/sql-createtable.html
and the "UNLOGGED" part to check if an unlogged table is suitable for you.

>
> On Tue, Jul 14, 2015 at 6:37 PM Julien Rouhaud
> <julien(dot)rouhaud(at)dalibo(dot)com <mailto:julien(dot)rouhaud(at)dalibo(dot)com>> wrote:
>
> On 14/07/2015 18:21, Igor Stassiy wrote:
> > Julien, I have the following setting for WAL level: #wal_level =
> minimal
> > (which defaults to minimal anyway)
> >
>
> Sorry, I sent my mail too early :/
>
> So, option #2 is winner by design. You didn't say anything about your
> needs, so it's hard to help you much more.
>
> If you don't care about losing data on this table if your server
> crashes, you can try option #3 with an unlogged table.
>
>
> > On Tue, Jul 14, 2015 at 6:19 PM Julien Rouhaud
> > <julien(dot)rouhaud(at)dalibo(dot)com <mailto:julien(dot)rouhaud(at)dalibo(dot)com>
> <mailto:julien(dot)rouhaud(at)dalibo(dot)com
> <mailto:julien(dot)rouhaud(at)dalibo(dot)com>>> wrote:
> >
> > On 14/07/2015 11:12, Igor Stassiy wrote:
> > > Hello,
> > >
> > > I am benchmarking different ways of putting data into table
> on table
> > > creation:
> > >
> > > 1. INSERT INTO c SELECT * FROM a JOIN b on a.id
> <http://a.id> <http://a.id>
> > <http://a.id> = b.id <http://b.id> <http://b.id>
> > > <http://b.id>;
> > > 2. CREATE TABLE c AS SELECT * FROM a JOIN b on a.id
> <http://a.id> <http://a.id>
> > <http://a.id> = b.id <http://b.id> <http://b.id>
> > > <http://b.id>;
> > > 3. psql -c "COPY (SELECT * FROM a JOIN b on a.id
> <http://a.id> <http://a.id>
> > <http://a.id> = b.id <http://b.id> <http://b.id>
> > > <http://b.id>) TO STDOUT" |
> > > parallel --block 128M --jobs 4 --pipe psql -c "COPY c FROM
> STDIN";
> > >
> > > (the parallel command is available as part of parallel deb
> package in
> > > Ubuntu for example, it splits the stdin by newline character
> and feeds
> > > it to the corresponding command)
> > >
> > > Both tables a and b have ~16M records and one of the columns
> in a is
> > > geometry (ranging from several KB in size to several MB).
> Columns in b
> > > are mostly integers.
> > >
> > > The machine that I am running these commands on has the
> following
> > > parameters:
> > >
> > > default_statistics_target = 50 # pgtune wizard 2012-06-06
> > > maintenance_work_mem = 1GB # pgtune wizard 2012-06-06
> > > constraint_exclusion = on # pgtune wizard 2012-06-06
> > > checkpoint_completion_target = 0.9 # pgtune wizard 2012-06-06
> > > effective_cache_size = 48GB # pgtune wizard 2012-06-06
> > > work_mem = 80MB # pgtune wizard 2012-06-06
> > > wal_buffers = 8MB # pgtune wizard 2012-06-06
> > > checkpoint_segments = 16 # pgtune wizard 2012-06-06
> > > shared_buffers = 16GB # pgtune wizard 2012-06-06
> > > max_connections = 400 # pgtune wizard 2012-06-06
> > >
> > > One would expect the 3rd option to be faster than 1 and 2,
> however 2
> > > outperforms both by a large margin (sometimes x2). This is
> especially
> > > surprising taking into account that COPY doesn't acquire a
> global lock
> > > on the table, only a RowExclusiveLock
> > > (according
> > > to
> >
> http://www.postgresql.org/message-id/10611.1014867684@sss.pgh.pa.us)
> > >
> >
> > What is wal_level value? I think this is because of an
> optimisation
> > happening with wal_level = minimal:
> >
> > "In minimal level, WAL-logging of some bulk operations can be
> safely
> > skipped, which can make those operations much faster"
> >
> > see
> >
> http://www.postgresql.org/docs/current/static/runtime-config-wal.html
> >
> > > So is option 2 a winner by design? Could you please suggest
> other
> > > alternatives to try (if there are any)? And what might be
> the reason
> > > that 3 is not outperforming the other 2?
> > >
> > > Thank you,
> > > Igor
> > >
> > >
> >
> >
> > --
> > Julien Rouhaud
> > http://dalibo.com - http://dalibo.org
> >
>
>
> --
> Julien Rouhaud
> http://dalibo.com - http://dalibo.org
>

--
Julien Rouhaud
http://dalibo.com - http://dalibo.org

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Mitu Verma 2015-07-15 05:44:16 Re: unexpected data beyond EOF in block 260 of relation pg_tblspc
Previous Message Kevin Grittner 2015-07-14 21:12:31 Re: Disconnected but query still running