From: | "Carlo Stonebanks" <stonec(dot)register(at)sympatico(dot)ca> |
---|---|
To: | pgsql-performance(at)postgresql(dot)org |
Subject: | Re: Performace Optimization for Dummies |
Date: | 2006-09-28 21:04:31 |
Message-ID: | efhdcr$2873$1@news.hub.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Lots of great info here, I will see what applies to my situation. However, I
don't see bulk inserts of the tables working, because all of the tables need
to be refreshed as values to deduplicate and match will change with every
row added. In order for this to work, i would have to write queries against
the hash tables. This is where something like MySQL's in-memory tables would
have come in handy...
What is GDB?
Carlo
"Matthew Nuzum" <matthew(dot)nuzum(at)canonical(dot)com> wrote in message
news:451C18BE(dot)8050908(at)canonical(dot)com(dot)(dot)(dot)
> Carlo Stonebanks wrote:
>>> are you using the 'copy' interface?
>>
>> Straightforward inserts - the import data has to transformed, normalised
>> and
>> de-duped by the import program. I imagine the copy interface is for more
>> straightforward data importing. These are - buy necessity - single row
>> inserts.
>>
>
> I know this is an answer to a question you didn't ask, but here it is. I
> was once doing stuff where I processed log files and had to do many
> lookups to normalize the data before insertion.
>
> I started out doing everything in SQL and using postgresql tables and it
> took a little over 24 hours to process 24 hours worth of data. Like you,
> it was single process, many lookups.
>
> I found a better way. I rewrote it (once in c#, again in python) and
> used hashtables/dictionaries instead of tables for the lookup data. For
> example, I'd start by loading the data into hash tables (yes, this took
> a *lot* of ram) then for each row I did something like:
> 1. is it in the hash table?
> 1. If not, insert it into the db
> 1. Insert it into the hashtable
> 2. Get the lookup field out of the hash table
> 3. Output normalized data
>
> This allow me to create text files containing the data in COPY format
> which can then be inserted into the database at dramatically increased
> speeds.
>
> My first version in C# (mono) cut the time down to 6 hours for 24 hours
> worth of data. I tweaked the algorithms and rewrote it in Python and got
> it down to 45 min. (Python can't take all the credit for the performance
> boost, I used an improved technique that could have been done in C# as
> well) This time included the time needed to do the copy and update the
> indexes.
>
> I created a version that also used gdb databases instead of hash tables.
> It increased the time from 45 min to a little over an hour but decreased
> the memory usage to something like 45MB (vs dozens or hundreds of MB per
> hashtable)
> --
> Matthew Nuzum
> newz2000 on freenode
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq
>
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2006-09-28 21:28:18 | Re: archive wal's failure and load increase. |
Previous Message | Steve Atkins | 2006-09-28 21:04:21 | Re: Performace Optimization for Dummies |