From: | "Kevin Bartz" <kbartz(at)loyaltymatrix(dot)com> |
---|---|
To: | <pgsql-general(at)postgresql(dot)org> |
Subject: | FW: Out of swap space & memory |
Date: | 2004-08-09 17:37:59 |
Message-ID: | 20040809174353.63D4540010@omta18.mta.everyone.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
-----Original Message-----
From: Kevin Bartz [mailto:kbartz(at)loyaltymatrix(dot)com]
Sent: Monday, August 09, 2004 10:37 AM
To: 'mike(at)thegodshalls(dot)com'
Subject: RE: [GENERAL] Out of swap space & memory
Thanks for your reply, Mike! Theoretically, I should need only six of the
columns, but as a means of verifying integrity, I would like to de-dup using
all the columns. For instance, if there are two rows identical everywhere
but some column outside the six, I would like to know about it so I can
report back to the data provider. Maybe there's some other way to do this
kind of check?
Kevin
-----Original Message-----
From: mike g [mailto:mike(at)thegodshalls(dot)com]
Sent: Friday, August 06, 2004 9:07 PM
To: Kevin Bartz
Subject: RE: [GENERAL] Out of swap space & memory
hmmmm, Can you determine which are rows are duplicates by examining one
column or must you examine all the columns?
On Fri, 2004-08-06 at 22:56, Kevin Bartz wrote:
> Thanks for your reply, but how can I then solve the problem of duplicates?
> Using your example, if one duplicate lives between rows 1 and 1,000,000
and
> another between rows 1,000,001 and 2,000,000, de-duping them individually
> will result in the duplicate showing up twice.
>
> Kevin
>
> -----Original Message-----
> From: mike g [mailto:mike(at)thegodshalls(dot)com]
> Sent: Friday, August 06, 2004 8:52 PM
> To: Kevin Bartz
> Subject: RE: [GENERAL] Out of swap space & memory
>
> Can octanenights_raw be altered? If so using the Sequence feature in
> postgres you could autopopulate that column with a number starting at 1
> on up to 36 million. Then your select queries could be select X from
> octanenights_raw where sequence_column_name > 1 and sequence_column_name
> < 1000000 etc.
>
> Mike
>
> On Fri, 2004-08-06 at 22:40, Kevin Bartz wrote:
> > Well, all I'm doing right now is using psql from the command line. A bit
> > unintelligent, I know, but I want to make sure things will work
> > appropriately before I dive into the world of query editors.
> >
> > Thanks for your suggestions. My data won't need to be updated regularly.
> In
> > this case, as in all others, loading each set of data is a one-shot
> process,
> > so I don't think I'll need to worry about truncating.
> >
> > I can't think of any way to break the problem into steps, since the
> > duplicates may well be scattered throughout the table. If I split
> > octanenights into octanenights1 and octanenights2 and then de-dup each
> > individually, I would have to first be sure that octanenights1 does not
> > share a duplicate with octanenights2, or that duplicate would appear in
> the
> > UNION ALLed version. Maybe I'm missing something?
> >
> > Thanks for your kind response.
> >
> > Kevin
> >
> > -----Original Message-----
> > From: mike g [mailto:mike(at)thegodshalls(dot)com]
> > Sent: Friday, August 06, 2004 8:19 PM
> > To: Kevin Bartz
> > Cc: 'Manfred Koizar'; pgsql-general(at)postgresql(dot)org
> > Subject: Re: [GENERAL] Out of swap space & memory
> >
> > Ok,
> >
> > This is a long shot but how are you executing your code? In say a
> > pgadminIII sql window with the below entered line after line?
> >
> > If so I believe it will be treated as one transaction. With the default
> > settings postgres would have to keep track of everything done to be able
> > to rollback all the changes if it failed. I would believe that would
> > force it to keep track of all 56 million rows combined in memory
> > (probably just the oid column - I am sure the other more experienced
> > postgresql wizards can verify) but still that can take a lot of
> > resources.
> >
> > If by chance you are doing it one sweep try executing it in separate
> > steps so the commit can be executed.
> >
> > Hopefully then you won't run out of resources then.
> >
> > Are you doing a drop / create say everynight to update your data? If so
> > perhaps using TRUNCATE octanenights might be more efficient.
> >
> > If you must drop a full table perhaps a vacuum should be done
> > afterwards???
> >
> > Mike
> >
> > On Fri, 2004-08-06 at 21:32, Kevin Bartz wrote:
> > > Mike, thanks so much for your reply. I'm sorry for not showing you my
> SQL.
> > I
> > > didn't show it because I couldn't manage to boil it down to something
> > > reproducible that everyone could try. But here's what it was:
> > >
> > > drop table octanenights;
> > > CREATE TABLE octanenights (member_id varchar(100), campaign_id
> > varchar(100),
> > > catalog_type varchar(100), pushed int, delivered int, clicks int,
opened
> > > int, month varchar(100), type1 int, type2 int, type3 int, type4 int,
> type5
> > > int);
> > >
> > > copy octanenights from
> > > '/home/kevin/octanenights/proc/uberfile/uberfile1.txt' null as '';
> > > copy octanenights from
> > > '/home/kevin/octanenights/proc/uberfile/uberfile2.txt' null as '';
> > > copy octanenights from
> > > '/home/kevin/octanenights/proc/uberfile/uberfile3.txt' null as '';
> > >
> > > select * from octanenights limit 10;
> > > alter table octanenights rename to octanenights_raw;
> > >
> > > -- de-dup the table
> > > select member_id, campaign_id, catalog_type, pushed, delivered,
clicks,
> > > opened, month, type1, type2, type3, type4, type5
> > > into octanenights
> > > from octanenights_raw
> > > group by member_id, campaign_id, catalog_type, pushed, delivered,
> clicks,
> > > opened, month, type1, type2, type3, type4, type5;
> > >
> > > Let me tell you a little about octanenights. It's a file of about
> > 36,000,000
> > > rows, each describing an e-mail sent. Unfortunately, there are
duplicate
> > > records scattered throughout the table, which I do not care about. One
> > might
> > > suggest that I could've used uniq from the command line for this, but
> the
> > > data were not sorted originally and the duplicate records may be
> scattered
> > > anywhere in the table. The objective in the final line is to de-dup
the
> > > table and place it into octanenights, leaving the original in
> > > octanenights_raw in case I ever need to refer back to it.
> > >
> > > MS SQL Server, with as much RAM and less clock speed, de-dups the
table
> in
> > > about six minutes. The de-duped version has about 26,000,000 rows. The
> > final
> > > line is where Postgres gobbles up all my swap and RAM and then conks
out
> > > completely.
> > >
> > > Am I doing something wrong? Maybe there was a better way to approach
> this
> > > problem? I'd be open to suggestions of any kind, since I'm still very,
> > very
> > > new to the world of optimizing Postgres.
> > >
> > > Kevin
> > >
> > > -----Original Message-----
> > > From: pgsql-general-owner(at)postgresql(dot)org
> > > [mailto:pgsql-general-owner(at)postgresql(dot)org] On Behalf Of Manfred
Koizar
> > > Sent: Tuesday, August 03, 2004 3:04 AM
> > > To: Kevin Bartz
> > > Cc: pgsql-general(at)postgresql(dot)org
> > > Subject: Re: [GENERAL] Out of swap space & memory
> > >
> > > On Mon, 2 Aug 2004 20:08:45 -0700, "Kevin Bartz"
> > > <kbartz(at)loyaltymatrix(dot)com> wrote:
> > > >is there any way I can run this query?
> > >
> > > What query? You didn't show us your SQL.
> > >
> > > Servus
> > > Manfred
> > >
> > > ---------------------------(end of
broadcast)---------------------------
> > > TIP 3: if posting/reading through Usenet, please send an appropriate
> > > subscribe-nomail command to majordomo(at)postgresql(dot)org so that
your
> > > message can get through to the mailing list cleanly
> > >
> > >
> > > ---------------------------(end of
broadcast)---------------------------
> > > TIP 9: the planner will ignore your desire to choose an index scan if
> your
> > > joining column's datatypes do not match
From | Date | Subject | |
---|---|---|---|
Next Message | Kevin Bartz | 2004-08-09 18:33:04 | Re: Out of swap space & memory |
Previous Message | Patrick Vachon | 2004-08-09 17:29:25 | Implicit join |