From: | Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> |
---|---|
To: | Varadharajan Mukundan <srinathsmn(at)gmail(dot)com> |
Cc: | vjoshi(at)zetainteractive(dot)com, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org> |
Subject: | Re: Performance issues |
Date: | 2015-03-13 22:25:50 |
Message-ID: | CAOR=d=2L86fB+HWX8h078tYsd7LVKKFPQb=jfd8YNRU1Fc7mUA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
On Fri, Mar 13, 2015 at 4:03 PM, Varadharajan Mukundan
<srinathsmn(at)gmail(dot)com> wrote:
>> We might even consider taking experts advice on how to tune queries and
>> server, but if postgres is going to behave like this, I am not sure we would
>> be able to continue with it.
>>
>> Having said that, I would day again that I am completely new to this
>> territory, so I might miss lots and lots of thing.
>
> My two cents: Postgres out of the box might not be a good choice for
> data warehouse style queries, that is because it is optimized to run
> thousands of small queries (OLTP style processing) and not one big
> monolithic query. I've faced similar problems myself before and here
> are few tricks i followed to get my elephant do real time adhoc
> analysis on a table with ~45 columns and few billion rows in it.
>
> 1. Partition your table! use constraint exclusion to the fullest extent
> 2. Fire multiple small queries distributed over partitions and
> aggregate them at the application layer. This is needed because, you
> might to exploit all your cores to the fullest extent (Assuming that
> you've enough memory for effective FS cache). If your dataset goes
> beyond the capability of a single system, try something like Stado
> (GridSQL)
> 3. Storing index on a RAM / faster disk disk (using tablespaces) and
> using it properly makes the system blazing fast. CAUTION: This
> requires some other infrastructure setup for backup and recovery
> 4. If you're accessing a small set of columns in a big table and if
> you feel compressing the data helps a lot, give this FDW a try -
> https://github.com/citusdata/cstore_fdw
Agreed here. IF you're gonna run reporting queries against postgresql
you have to optimize for fast seq scan stuff. I.e. an IO subsystem
that can read a big table in hundreds of megabytes per second.
Gigabytes if you can get it. A lot of spinning drives on a fast RAID
card or good software raid can do this on the cheapish, since a lot of
times you don't need big drives if you have a lot. 24 cheap 1TB drives
that each can read at ~100 MB/s can gang up on the data and you can
read a 100GB in a few seconds. But you can't deny physics. If you need
to read a 2TB table it's going to take time.
If you're only running 1 or 2 queries at a time, you can crank up the
work_mem to something crazy like 1GB even on an 8GB machine. Stopping
sorts from spilling to disk, or at least giving queries a big
playground to work in can make a huge difference. If you're gonna give
big work_mem then definitely limit connections to a handful. If you
need a lot of persistent connections then use a pooler.
The single biggest mistake people make in setting up reporting servers
on postgresql is thinking that the same hardware that worked well for
transactional stuff (a handful of SSDs and lots of memory) might not
help when you're working with TB data sets. The hardware you need
isn't the same, and using that for a reporting server is gonna result
in sub-optimal performance.
--
To understand recursion, one must first understand recursion.
From | Date | Subject | |
---|---|---|---|
Next Message | Vivekanand Joshi | 2015-03-13 23:28:39 | Re: Performance issues |
Previous Message | Tom Lane | 2015-03-13 22:04:06 | Re: Postgres inconsistent use of Index vs. Seq Scan |