Re: Partioning with overlapping and non overlapping constraints

From: Tim Uckun <timuckun(at)gmail(dot)com>
To: Melvin Davidson <melvin6925(at)gmail(dot)com>
Cc: Marc Mamin <M(dot)Mamin(at)intershop(dot)de>, pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Partioning with overlapping and non overlapping constraints
Date: 2015-02-09 23:05:59
Message-ID: CAGuHJrM29=Lxo4HEa_dBPCmeu_KFqkgYz_0OAt301zoQUtKA+A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

This approach wouldn't work I think. The data keeps growing in the "hot"
table.

On Tue, Feb 10, 2015 at 6:01 AM, Melvin Davidson <melvin6925(at)gmail(dot)com>
wrote:

> Well, without knowing too much about your application, it certainly sounds
> like using the metricts_YYYYMMDD is the way to go. As for modifying the
> constraint daily, couldn't you just use
>
> where timestamp > current_date - Interval '1 Day'
>
> ?
>
> On Mon, Feb 9, 2015 at 5:14 AM, Marc Mamin <M(dot)Mamin(at)intershop(dot)de> wrote:
>
>>
>> >I have two partitioning questions I am hoping somebody can help me with.
>> >
>> >I have a fairly busy metric(ish) table. It gets a few million records
>> per day, the data is transactional for a while but then settles down and is
>> used for analytical purposes later.
>> >
>> >When a metric is reported both the UTC time and the local times are
>> stored along with the other data belonging to the metric.
>>
>> Don't you have duplicate information within your UTC, location and
>> local_time data ?
>> Maybe you can just attach a timezone to each location...
>>
>> >I want to partition this table to both make it faster to query and also
>> to spread out the writes. Ideally the partitions would be based on the UTC
>> timestamp and the sending location. For example
>> >
>> >metrics_location_XXXXX_2015_01_01
>> >
>> >First problem with this approach is that there could be tens of
>> thousands of locations so this is going to result hundreds of thousands of
>> tables. I know there are no upper limits to how many tables there are but
>> I am thinking this might really get me into trouble later.
>>
>> With only a few millions rows per day, weekly or even monthly partitions
>> without regard of locations should be sufficient for older data.
>> It should be possible to partition your hot data differently; But Instead
>> of using one partition per location, you may use a hash/modulo approach to
>> keep the number of partitions in a reasonable count if required at all
>> (This can be helpful: https://github.com/markokr/pghashlib) Here I
>> would avoid to include time information except for the limit between old
>> and hot tables. And depending on the pattern and performance requirement of
>> your analytic queries this may be sufficient (i.e. don't partition on the
>> time at all).
>> With smaller partitions for hot data, it should be quite fast to move
>> them one by one to the old data. I have no experience with the trigger
>> based partitioning of Postgres (we handle partitions logically at the
>> application level), so I'm not sure how difficult this approach is. I
>> suppose that you'll need a function that move data from hot to old
>> partitons and that fix the triggers accordingly.
>>
>> >
>> >Second and possibly more vexing problem is that often the local time is
>> queried. Ideally I would like to put three constraints on the child
>> tables. Location id, UTC timestamp and the local time but obviously the
>> local timestamps would overlap with other locations in the same timezone
>> Even if I was to only partition by UTC the local timestamps would overlap
>> between tables.
>> >
>> >So the questions in a nutshell are.
>> >
>> >1. Should I be worried about having possibly hundreds of thousands of
>> shards.
>> >2. Is PG smart enough to handle overlapping constraints on table and
>> limit it's querying to only those tables that have the correct time
>> constraint.
>>
>> If you partition on the UTC time only, you don't have overlapping. When
>> querying on the local time, the planner will consider all partitions, but
>> an additional index or constraint on this column should be sufficient as
>> long as your partition count remains small.
>>
>> regards,
>> Marc Mamin
>>
>
>
>
> --
> *Melvin Davidson*
> I reserve the right to fantasize. Whether or not you
> wish to share my fantasy is entirely up to you.
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Michael Paquier 2015-02-10 00:09:27 Re: Mult-standby streaming replication master failover
Previous Message Tim Uckun 2015-02-09 23:03:44 Re: Partioning with overlapping and non overlapping constraints