Help with optimizing a query over hierarchical data

From: Damon Snyder <damon(at)huddler-inc(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Help with optimizing a query over hierarchical data
Date: 2014-02-28 20:01:52
Message-ID: CACkQbuiU7aaWxHsOuwVZgibg=Kjj5+tyM3h3-10ntdBK7SSYpw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi Everyone,
We have a data set and access pattern that is causing us some performance
issues. The data set is hierarchical with about 2 million rows at the
lowest level (object), followed by 500k at the next level (container) and
approximately 10 at the highest level (category).

The way the data is structured objects live in one primary container but
can also reside in one or more secondary containers. The secondary
containers have to be filtered by category ids for access control. This
doesn't really pose a problem on the primary containers because they are
homogeneous by category but it slows things down on the secondary
containers.

The access pattern certainly complicates things. We need to order the data
by a value (chronological, score, etc) and jump to an arbitrary position
and window within the ordering. I have provided an example of the
chronological and score based ordering and indexing in the script provided
below. I'm proxying the chronological ordering by using the sequence
generated id for the sample code. In the application we use a timestamp.

I have created sample code <https://gist.github.com/drsnyder/9277054> (
https://gist.github.com/drsnyder/9277054) that will build a dataset that is
similar to the one that we have in production. The distributions aren't
exactly the same but they should reproduce the behavior. I have also
provided examples of the queries that give us problems.

The code is documented inline and points out the queries that are causing
problems. You should be able to run the script on a 9.2.2 (we use 9.2.6)
database in about 10m on a development laptop (or 1-2m on production-like
hardware) to experiment with the data and the SQL. The total database size
is about 570MB.

The primary query that I'm trying to optimize executes in about 1600ms on
my laptop and about 800ms on production-like hardware (more for the score
version). My target is to get the data fetch down below 100ms if possible.

If you have any suggestions it would be greatly appreciated. Am I missing
something obvious? Is there a logically equivalent alternative that would
be more efficient?

I'm also open to creative ways to cache the data in PostgreSQL. Should we
use a rollup table or arrays? Those options aren't ideal because of the
maintenance required but if its the only option I'm ok with that.

Thanks,
Damon

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Claudio Freire 2014-03-02 01:02:24 Re: Help with optimizing a query over hierarchical data
Previous Message Tom Coogan 2014-02-28 18:29:29 Re: Inefficient filter order in query plan