Quick Links

Re: Guidance Requested - Bulk Inserting + Queries

From:	Benjamin Johnson <benjamin(dot)johnson(at)getcarbonblack(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Guidance Requested - Bulk Inserting + Queries
Date:	2011-12-22 02:30:44
Message-ID:	4EF29654.9010303@getcarbonblack.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

Jeff,

Sorry for the delayed response. Please see (some) answers inline.

On 12/1/2011 9:06 AM, Jeff Janes wrote:
> On Wed, Nov 30, 2011 at 7:27 AM, Benjamin Johnson
> <benjamin(dot)johnson(at)getcarbonblack(dot)com> wrote:
>> Experts,
>>
>> Quick Summary: data can now be inserted very quickly via COPY + removing
>> indexes, but is there a design or some tricks to still allow someone to
>> query while the partition is still active and 'hot' ?
>>
>> - Postgres 9.1
>> - Windows 7 (64-bit) , although this is just for the current test and
>> could vary depending on situation
>> - We have 4 main tables with daily partitions
> How long are the daily partitions kept for?
We want this to be user-configurable but ideally 30 - 90 days, possibly
longer for (order of magnitude) smaller customers.
>> - Each table/partition has multiple indexes on it
>> - Streaming logs from client machines into our server app which
>> processes the logs and tries to shove all that data into these daily
>> partitions as fast as it can.
> Why shove it in as fast as you can? If you want to both read and
> write at the same time, then focusing first only on writing and
> worrying about reading as an after thought seems like the wrong thing
> to do.
You're probably correct in that we need to think about the entire system
as a whole. We're concerned with getting the data
from our host-based to our server where it is processed and stored.
Because our system is essentially a giant logging service for
your enterprise, most of the time we're collecting data and writing it.
The main times it will be read is when some security incident
occurs, but between those we expect it to be very write heavy.

We're probably most concerned with write performance because we were
originally seeing poor times and were scared by how well
it would scale. We've improved it a lot so we might just need to take a
step back and see what else we can do for the overall system.

>> - Using COPY and removed original primary key unique constraints to try
>> to get it to be as fast as possible (some duplicates are possible)
>> - Will remove duplicates in a later step (disregard for this post)
>>
>> We now found (thanks Andres and Snow-Man in #postgresql) that in our
>> tests, after the indexes get too large performance drops signficantly
>> and our system limps forward due to disk reads (presumably for the
>> indexes).
> How many hours worth of data can be loaded into the new partition
> before the performance knee hits?
In simulations, if I try to simulate the amount of data a large customer
would send, then it is just about an hour worth of data before the indexes
get to be several gigabytes in size and performance really goes downhill
-- the "knee" if you will.
> After the knee, how does the random disk read activity you see compare
> to the maximum random disk reads your IO system can support? How many
> COPYs were you doing at the same time?
I don't have exact statistics, but we had 4 writer threads all doing
copy into 4 tables as fast as they receive data.
The system is very much NOT ideal -- Windows 7 Developer-Class
Workstation with (one) 7200 RPM Harddrive. I want to find bottlebecks
in this
system and then see what real servers can handle. (We're a small
company and only now are starting to be able to invest in dev/test servers.

>
> During this test, was there background select activity going on, or
> was the system only used for COPY?
I pretty much stripped it entirely down to just doing the writes. Data
was coming in over HTTP to a python web stack, but that was pretty much
just passing these logfiles to the (C++) writer threads.
>> If we remove the indexes, performance for our entire sample
>> test is great and everything is written to postgresql very quickly.
>> This allows us to shove lots and lots of data in (for production
>> possibly 100 GB or a TB per day!)
> How much do you need to shove in per day? If you need to insert it,
> and index it, and run queries, and deal with maintenance of the older
> partitions, then you will need a lot of spare capacity, relative to
> just inserting, to do all of those things. Do you have windows where
> there is less insert activity in which other things can get done?
That's something we keep asking ourselves. Right now it's about 10 MB /
client per day. Some customers want 50,000 clients which would
be 500 GB per day if my math is correct. We know we will never handle
this with a single server, but we want to get up as high as we can (say
5000 - 10000)
before saying that our customers have to add more hardware.

> Cheers,
>
> Jeff

We managed to sort of get around the issue by having hourly tables
inherit from our daily tables. This makes our indexes smaller and the
writes in our tests don't
seem to hit this same limit (at least so far.) I have a couple
follow-up questions:

1) Would it be acceptable to have let's say 60 daily partitions and then
each of those has 24 hourly partitions? Would it be better to after a
day or two (so that data is now old and mostly unchanged) "rollup" the
hourly tables into their respective daily table and then remove the
hourly tables?

2) Some of our indexes are on an identifier that is a hash of some event
attributes, so it's basically a random BIGINT. We believe part of the
problem is that each row could be in an entirely different location in
the index thus causing lots of seeking and thrashing. Would doing
something like having our index become a multi-column index by doing
(event_timestamp, originally_index_column) be better so that they closer
in proximity to other events coming in around the same time? I have to
admit that I don't really know how indexes are stored / paged.

3) Does anyone else have similar systems where they have a ton of data
coming in that they also want to query? Any tips you can provide or
alternative designs? Once the data is in, it will 99.9% of the time
(100% of the time for some tables) be static. Part of the issue is that
the user wants to be able to search based on all sorts of attributes --
this leads to lots of indexes and more disk/memory usage when writing.

Ben

--
Benjamin Johnson
http://getcarbonblack.com/ | @getcarbonblack
cell: 312.933.3612

In response to

Re: Guidance Requested - Bulk Inserting + Queries at 2011-12-01 15:06:42 from Jeff Janes

Responses

Re: Guidance Requested - Bulk Inserting + Queries at 2011-12-22 08:04:30 from Jeff Janes

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Jeff Janes	2011-12-22 08:04:30	Re: Guidance Requested - Bulk Inserting + Queries
Previous Message	Havasvölgyi Ottó	2011-12-21 23:29:16	Re: Dramatic change in memory usage with version 9.1