From: | "Michael L(dot) Artz" <dragon(at)october29(dot)net> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | Network Flow Schema + Bulk Import/Updates |
Date: | 2005-09-21 01:17:58 |
Message-ID: | 4330B4C6.8030108@october29.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
I'm fairly new at Postgres and had some basic design questions. My
problem is basically that I want to do large bulk imports (millions of
rows) into a large DB (billions of rows) where there might already be
data that needs to be updated instead of inserting a new row. I read a
similar post a few days ago, but I just had a couple more questions.
My DB decomposes into storing something similar to Cisco Netflow
records, i.e. source IP, source Port, dest IP, dest Port, and a bit more
information about network flow data, as well as 2 timestamps, a count
field, and a foreign key into a small table (<100 rows) (schema below).
I will be nightly importing the data (which could number in the millions
of rows) and, since the DB is not quite production, I can do almost
whatever I want, as long as the DB is in a query-able state by morning.
The nightly imports are the only modifications to the table; otherwise
the table is read-only and accessed via a web application. Needless to
say, the table will get huge (I'm estimating billions of rows within a
month), although it should level off as I get more and more duplicates.
CREATE TABLE flow (
source_ip INET,
--- SMALLINT not normally big enough (signed vs. unsigned),
--- but convert in application space
source_port SMALLINT,
dest_ip INET,
dest_port SMALLINT,
comment VARCHAR(128),
count INTEGER,
first_seen DATE,
last_seen DATE,
fk INTEGER NOT NULL REFERENCES small_table(small_id)
);
CREATE INDEX flow_source_ip_idx ON flow (source_ip);
CREATE INDEX flow_dest_ip_idx ON flow (dest_ip);
When I import the data (thousands to millions of rows), I want to check
and see if there is a row that already exists with the same source_ip,
dest_ip, and comment and, if so, I want to update the row to increment
and update the first_seen and last_seen dates if need be. Otherwise,
just insert a new row with a count of 1.
Basic question: What's the best way to go about this?
From what I have read, it seemed like the consensus was to import (COPY)
the new data into a temporary table and then work on the inserts and
updates from there. I also read some suggestions involving triggers ...
which way would be best given my dataset? I've thought about doing it
in application space (Perl) by querying out all the rows that need to be
updated, deleting said rows, dropping the indexes, and then doing a bulk
COPY of any new rows plus the modified old rows ... does this sound like
a good/bad idea?
Some other basic questions:
-Should I drop and recreate the indexes anytime, given that the updates
and selects will use them extensively to find matching rows in the
existing flow table? Perhaps create a new index on (source_ip, dest_ip)?
-What do you think of the schema for storing network flow data
considering that I'm only storing single IPs and I do need the comment
field? Is the INET data type the best thing to use here (12 bytes, I
think), or should I use my application to convert my IPs to "INTEGER"
equivalents (4 bytes)? Perhaps create a new Postgres data type?
Although I have another table which actually uses INET networks, and I'd
like to be able to join the two, so a new data type might be more work
than I initially though (although I'm up for it) :)
-How many rows at a time can I practically insert with the COPY
command? I've read about people doing millions, but is that realistic,
or should I break my load into X-record chunks/transactions?
Any suggestions/comments/flames would be appreciated.
Thanks
-Mike
From | Date | Subject | |
---|---|---|---|
Next Message | CSN | 2005-09-21 01:34:48 | Preventing duplicate records according to several fields |
Previous Message | Jim C. Nasby | 2005-09-20 22:59:37 | Re: running vacuum in scripts |