Re: Large data and slow queries

From: Andrew Staller <andrew(at)timescale(dot)com>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Large data and slow queries
Date: 2017-04-21 20:41:25
Message-ID: CAEsM1FuNwJNUBUBk5pwrQqP_w5D7MWJ+YaR-mRG-r=0Q-Z-y_Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Samuel,

Short answer to your questions: (1) TimescaleDB and CitusDB are focusing
on solving different problems, and (2) TimescaleDB is an Apache 2-licensed
extension to run in your Postgres database, not a fork or different system.

Longer answer to your first question:

From what we've read and the conversations we've had with Citus (awesome
folks, btw), they're primarily solving a different problem -- OLAP queries
and distributed transactions -- while we are focused on time-series data
analysis. As such, we haven't benchmarked against Citus and if we were to,
it would require some unrealistic finagling in order to make it an
apples-to-apples comparison.

As an example, their partitioning is based on a single primary key, while
all writes in their clustered version go to a cluster master (although
their hosted version of Citus MX changes this a bit). We perform
two-dimensional partitioning using your primary key and time, with the
partitions automatically created/closed on time intervals and data volume
in order to keep the most recent partitioned chunks of data (and their
B-trees) in memory to support fast ingest for time-series data. (***) We
have only released our single-node version, but the clustering will allow
you to write directly to any node. Citus has also decided only to support
the SQL queries they are optimized for, while we've made the decision to
support "full SQL" (even if some queries might not be fast).

*** We blogged yesterday about TimescaleDB's partitioning design choices in
more depth, if you are interested:
https://blog.timescale.com/time-series-data-why-and-how-to-u
se-a-relational-database-instead-of-nosql-d0cd6975e87c

On Thu, Apr 20, 2017 at 5:43 PM, Samuel Williams <
space(dot)ship(dot)traveller(at)gmail(dot)com> wrote:

> Andrew, how would timescaledb compare to citus - and is timescaledb an
> extension to postgres or is it an entirely separate system?
>
> On 21 April 2017 at 02:44, Andrew Staller <andrew(at)timescale(dot)com> wrote:
> > Awesome thread.
> >
> > Samuel,
> >
> > Just wanted you to be aware of the work we're doing at TimescaleDB
> > (http://www.timescale.com/) a time-series database extension for
> > PostgreSQL.
> >
> > Some of how we might help you:
> > - automatic partitioning by space (primary key - like country_id, for
> > instance) and time. This creates "chunks" of your data, right-sized by
> > volume and time constraints (which makes inserts fast at much greater
> scale
> > than Vanilla PostgreSQL - not sure if this is a problem for you)
> > - this will also help if your queries are selective on time and
> country_id
> > (per this example)
> > - the partitioning by time allows you to DROP old chunks without the need
> > for vacuums
> >
> > On Thu, Apr 20, 2017 at 8:30 AM, Vick Khera <vivek(at)khera(dot)org> wrote:
> >>
> >> I'm curious why you have so many partial indexes. Are you trying to make
> >> custom indexes per query? It seems to me you might want to consider
> making
> >> the indexes general, and remove the redundant ones (that have the same
> >> prefix list of indexed fields).
> >>
> >> Secondly your table is 102Gb. Clearly there's a lot of data here. How
> many
> >> rows does that take? I would further suggest that you partition this
> table
> >> such that there are no more than about 10 million rows per partition
> (I've
> >> done this by using a id % 100 computation). Maybe in your case it makes
> >> sense to partition it based on the "what" field, because it appears you
> are
> >> trying to do that with your partial indexes already.
> >>
> >> On Wed, Apr 19, 2017 at 10:11 PM, Samuel Williams
> >> <space(dot)ship(dot)traveller(at)gmail(dot)com> wrote:
> >>>
> >>> Okay, so after changing longitude/latitude to float4, and
> >>> re-organizing the table a bit, I got the query down from about 8
> >>> minutes to 40 seconds.
> >>>
> >>> The details are in the gist comments:
> >>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121
> >>>
> >>> Now, just need to get performance another 2 orders of magnitude
> >>> better. Can we make the index a bit more coarse grained, perhaps
> >>> convert long/lat to integers or something, use a hilbert index, use a
> >>> postgis index... ideas?
> >>>
> >>>
> >>> --
> >>> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> >>> To make changes to your subscription:
> >>> http://www.postgresql.org/mailpref/pgsql-general
> >>
> >>
> >
> >
> >
> > --
> > TimescaleDB | Growth & Developer Evangelism
> > c: 908.581.9509
> >
> > 335 Madison Ave.
> > New York, NY 10017
> > www.timescale.com
> > https://github.com/timescale/timescaledb
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

--
TimescaleDB* | *Growth & Developer Evangelism
c: 908.581.9509 <(908)%20581-9509>

335 Madison Ave.
New York, NY 10017
www.timescale.com
https://github.com/timescale/timescaledb

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2017-04-21 20:43:49 Re: Recover PostgreSQL database folder data
Previous Message Edson Lidorio 2017-04-21 20:36:18 Re: Recover PostgreSQL database folder data