From: | "Amit Langote" <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> |
---|---|
To: | "'Claudio Freire'" <klaussfreire(at)gmail(dot)com> |
Cc: | "'Alvaro Herrera'" <alvherre(at)2ndquadrant(dot)com>, "'Josh Berkus'" <josh(at)agliodbs(dot)com>, "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Pg Hackers'" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: On partitioning |
Date: | 2014-12-15 09:53:50 |
Message-ID: | 007201d0184d$077bd9d0$16738d70$@lab.ntt.co.jp |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Claudio Freire wrote:
> On Sun, Dec 14, 2014 at 11:12 PM, Amit Langote
> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> >> On egress you need some direct way to compare the scan quals with the
> >> partitioning values. I would imagine this to be similar to how scan
> >> quals are compared to the values stored in a BRIN index: each scan qual
> >> has a corresponding operator strategy and a scan key, and you can say
> >> "aye" or "nay" based on a small set of operations that can be run
> >> cheaply, again without any proof or running arbitrary expressions.
> >>
> >
> > My knowledge of this is far from being perfect, though to clear any
> confusions -
> >
> > As far as planning is concerned, I could not imagine how index access
> method way of pruning partitions could be made to work. Of course, I may
> be missing something.
>
> Let me be overly verbose, don't take it as patronizing, just answering
> in lots of detail why this could be a good idea to try.
>
Thanks for explaining. It helps.
> Normal indexes store a pointer for each key value of sorts. So B-Tree
> gets you a set of tids for each key, and so does GIN and hash.
>
> But BRIN is different in that it does the mapping differently. BRIN
> stores a compact, approximate representation of the set of keys within
> a page range. It can tell with some degree of (in)accuracy whether a
> key or key range could be part of that page range or not. The way it
> does this is abstracted out, but at its core, it stores a "compressed"
> representation of the key set that takes a constant amount of bits to
> store, and no more, no matter how many elements. What changes as the
> element it represents grows, is its accuracy.
>
> Currently, BRIN only supports min-max representations. It will store,
> for each page range, the minimum and maximum of some columns, and
> when
> you query it, you can compare range vs range, and discard whole page
> ranges.
>
> Now, what are partitions, if not page ranges?
Yes, I can see a partition as a page range. The fixed summary info in BRIN's terms would be range bounds in case this is a rang partition, list of values in case this is a list partition and hash value in case this is a hash partition.
There is debate on the topic but each of these partitions also happens to be a separate relation. IIUC, BRIN is an access method for a relation (say, top-level partitioned relation) that comes into play in executor if that access method survives as preferred access method by the planner. I cannot see a way to generalize it further and make it support each block range as a separate relation and then use it for partition pruning in planner. This is assuming a partitioned relation is planned as an Append node which contains a list of plans for surviving partition relations based on, say, restrict quals.
I may be thinking of BRIN as a whole as not being generalized enough but I may be wrong. Could you point out if so?
> A BRIN tuple is a min-max pair. But BRIN in more general, it could use
> other data structures to hold that "compressed representation", if
> someone implemented them. Like bloom filters [0].
>
> A BRIN index is a complex data structure because it has to account for
> physically growing tables, but all the complexities vanish when you
> fix a "block range" (the BR in BRIN) to a partition. Then, a mere
> array of BRIN tuples would suffice.
>
> BRIN already contains the machinery to turn quals into something that
> filters out entire partitions, if you provide the BRIN tuples.
>
IIUC, that machinery comes into play when, say, a Bitmap Heap scan starts, right?
> And you could even effectively matain a BRIN index for the partitions
> (just a BRIN tuple per partition, dynamically updated with every
> insertion).
>
> If you do that, you start with empty partitions, and each insert
> updates the BRIN tuple. Avoiding concurrency loss in this case would
> be tricky, but in theory this could allow very general partition
> exclusion. In fact it could even work with constraint exclusion right
> now: you'd have a single-tuple BRIN index for each partition and
> benefit from it.
>
> But you don't need to pay the price of updating BRIN indexes, as
> min-max tuples for each partition can be produced while creating the
> partitions if the syntax already provides the information. Then, it's
> just a matter of querying this meta-data which just happens to have
> the form of a BRIN tuple for each partition.
>
Thanks,
Amit
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2014-12-15 09:58:37 | Re: Escaping from blocked send() reprised. |
Previous Message | Kyotaro HORIGUCHI | 2014-12-15 09:21:16 | Re: alter user/role CURRENT_USER |