Re: advice sought - general approaches to optimizing queries around "event streams"

From: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To: Jonathan Vanasco <postgres(at)2xlp(dot)com>, PostgreSQL general <pgsql-general(at)postgresql(dot)org>
Subject: Re: advice sought - general approaches to optimizing queries around "event streams"
Date: 2014-09-26 21:23:19
Message-ID: 5425D947.2080304@archidevsys.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 27/09/14 09:02, Jonathan Vanasco wrote:
> I have a growing database with millions of rows that track resources against an event stream.
>
> i have a few handfuls of queries that interact with this stream in a variety of ways, and I have managed to drop things down from 70s to 3.5s on full scans and offer .05s partial scans.
>
> no matter how i restructure queries, I can't seem to get around a few bottlenecks and I wanted to know if there were any tips/tricks from the community on how to approach them.
>
> a simple form of my database would be:
>
> -- 1k of
> create table stream (
> id int not null primary key,
> )
>
> -- 1MM of
> create table resource (
> id int not null primary key,
> col_a bool,
> col_b bool,
> col_c text,
> );
>
> -- 10MM of
> create table streamevent (
> id int not null,
> event_timestamp timestamp not null,
> stream_id int not null references stream(id)
> );
>
> -- 10MM of
> create table resource_2_stream_event(
> resource_id int not null references resource(id),
> streamevent_id int not null references streamevent(id)
> )
>
> Everything is running off of indexes; there are no seq scans.
>
> I've managed to optimize my queries by avoiding joins against tables, and turning the stream interaction into a subquery or CTE.
> better performance has come from limiting the number of "stream events" ( which are only the timestamp and resource_id off a joined table )
>
> The bottlenecks I've encountered have primarily been:
>
> 1. When interacting with a stream, the ordering of event_timestamp and deduplicating of resources becomes an issue.
> I've figured out a novel way to work with the most recent events, but distant events are troublesome
>
> using no limit, the query takes 3500 ms
> using a limit of 10000, the query takes 320ms
> using a limit of 1000, the query takes 20ms
>
> there is a dedicated index of on event_timestamp (desc) , and it is being used
> according to the planner... finding all the records is fine; merging-into and sorting the aggregate to handle the deduplication of records in a stream seems to be the issue (either with DISTINCT or max+group_by)
>
>
> 2. I can't figure out an effective way to search for a term against an entire stream (using a tsquery/gin based search)
>
> I thought about limiting the query by finding matching resources first, then locking it to an event stream, but:
> - scanning the entire table for a term takes about 10 seconds on an initial hit. subsequent queries for the same terms end up using the cache, and complete within 20ms.
>
> I get better search performance by calculating the event stream, then searching it for matching documents, but I still have the performance issues related to limiting the window of events
>
> i didn't include example queries, because I'm more concerned with the general approaches and ideas behind dealing with large data sets than i am with raw SQL right now.
>
> i'm hoping someone can enlighten me into looking at new ways to solve these problems. i think i've learned more about postgres/sql in the past 48hour than I have in the past 15 years, and I'm pretty sure that the improvements I need will come from new ways of querying data , rather than optimizing the current queries.
>
Minor point: when specifying PRIMARY KEY, you don't need to also put NOT
NULL (this should make no change to performance).

I notice that the 'id' of 'streamevent' is not marked as a PRIMARY KEY,
so it will not have an index associated with it - hence referencing it
as a foreign key might be slower than expected.

Cheers,
Gavin

In response to

Browse pgsql-general by date

  From Date Subject
Next Message DrakoRod 2014-09-26 22:02:52 Re: password in recovery.conf
Previous Message Brent Wood 2014-09-26 21:22:34 Re: advice sought - general approaches to optimizing queries around "event streams"