Re: Partial index-based load balancing

From: Andy Colson <andy(at)squeakycode(dot)net>
To: fabio(at)vuole(dot)me, pgsql-general(at)postgresql(dot)org
Cc: Brendan Cervin <brendan(dot)c(at)geoop(dot)com>
Subject: Re: Partial index-based load balancing
Date: 2015-03-31 22:07:10
Message-ID: 551B1A8E.7030200@squeakycode.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 3/31/2015 3:28 PM, Fabio Ugo Venchiarutti wrote:
>
>
> On 01/04/15 06:12, Andy Colson wrote:
>> On 3/31/2015 1:58 AM, Fabio Ugo Venchiarutti wrote:
>>> Greetings
>>>
>>>
>>> I'm working for a startup and our core DB is growing rather fast.
>>>
>>> Our target scale is large enough that we expect some of our core tables'
>>> indexes to grow bigger than the memory on any single node over the next
>>> couple years (our current intended design involves conventional
>>> stream-replication-based, write-on-one-read-from-many load balancing).
>>>
>>> We don't fancy the idea of using inheritance through partitioning due to
>>> the maintenance overhead and our reliance on validation
>>> constraints/triggers.
>>>
>>> My proposal will be to instead create a number of partial indexes
>>> covering predefined ranges of client IDs, then use a connection-level
>>> routing mechanism that relies on what range the relevant client's data
>>> belongs to in order to address the right node and match the right
>>> partial index.
>>>
>>> The idea is to have any given read-only node hold just one of the
>>> partial indexes in its cache and never fetch index pages off its
>>> secondary storage.
>>> Scaling would just be a matter of increasing the partitioning density.
>>>
>>>
>>> I'm going to assume that I'm not the first one to come up with this
>>> strategy (and that there already is a name for it. If so, what is it?).
>>>
>>>
>>> Is it a valid setup or am I missing some key aspect of how index
>>> partitioning is meant to work?
>>>
>>>
>>> TIA
>>>
>>>
>>> Best regards
>>>
>>>
>>> Fabio Ugo Venchiarutti
>>>
>>>
>>
>> Have you timed it? It sounds like it should work ok, but just to play
>> devils advocate it sounds a little over complex.
>>
>> Heres my thinking: Have one index, have lots of read-only nodes, do the
>> connection-level routing. I think the PG caching and OS caching will be
>> smart enough to cache the parts of the index that's hot per node. You
>> are also less likely to confuse the planner.
>>
>> For example, I have one db, with web hits logging, with a timestamp
>> index. Its upward of 25 gig, but I only every query the last 24 hours
>> to get ip/session/hit counts. That part of the index/db is cached very
>> well and responds very quick. If I ever hit data that's many days old
>> it slow's down hitting the HD.
>>
>> I don't know for sure which method would be faster though. The only
>> real difference between the two methods is the number of indexes. I'd
>> bet there is no speed difference between them. PG will figure out what
>> data is hot and cache it.
>>
>> -Andy
>>
>>
>
>
> I always assumed that an index, or at least the root level of a BTREE
> one, had to be completely cached in order to guarantee instantaneous
> lookups, but your thinking is seems sound as after the first level
> resolution only certain branches will be walked on any given slave.
>
> Will try both approaches and update you with about my findings if you're
> interested (won't happen before a few weeks at least due to management
> bottlenecks).
>
> Many thanks
>
>
> F
>
>
>

Yes, interested. Everyone loves a good benchmark. And if your solution
(with multiple partial index) (which I've never seen discussed on this
list before) is faster, then I propose we name it after you. :-)

-Andy

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Joshua Ma 2015-04-01 00:43:39 Why does CREATE INDEX CONCURRENTLY need two scans?
Previous Message Fabio Ugo Venchiarutti 2015-03-31 20:28:14 Re: Partial index-based load balancing