Re: WIP: BRIN multi-range indexes

From: Zhihong Yu <zyu(at)yugabyte(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: WIP: BRIN multi-range indexes
Date: 2021-02-04 00:49:34
Message-ID: CALNJ-vTKV9HZ_yiMVZc7mBrc_95-SJmL+4AEjKSeMNgB-8Q+oQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,
For 0007-Remove-the-special-batch-mode-use-a-larger--20210203.patch :

+ /* same as preceding value, so store it */
+ if (compare_values(&range->values[start + i - 1],
+ &range->values[start + i],
+ (void *) &cxt) == 0)
+ continue;
+
+ range->values[start + n] = range->values[start + i];

It seems the comment doesn't match the code: the value is stored when
subsequent value is different from the previous.

For has_matching_range():
+ int midpoint = (start + end) / 2;

I think the standard notion for midpoint is start + (end-start)/2.

+ /* this means we ran out of ranges in the last step */
+ if (start > end)
+ return false;

It seems the above should be ahead of computation of midpoint.

Similar comment for the code in AssertCheckRanges().

Cheers

On Wed, Feb 3, 2021 at 3:55 PM Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
wrote:

> Hi,
>
> Here's an updated and significantly improved version of the patch
> series, particularly the multi-minmax part. I've fixed a number of
> stupid bugs in that, discovered by either valgrind or stress-tests.
>
> I was surprised by some of the bugs, or rather that the existing
> regression tests failed to crash on them, so it's probably worth briefly
> discussing the details. There were two main classes of such bugs:
>
>
> 1) missing datumCopy
>
> AFAICS this happened because there were a couple missing datumCopy
> calls, and BRIN covers multiple pages, so with by-ref data types we
> added a pointer but the buffer might have gone away unexpectedly.
> Regular regression tests passed just fine, because brin_multi runs
> almost separately, so the chance of the buffer being evicted was low.
> Valgrind reported this (with a rather enigmatic message, as usual), and
> so did a simple stress-test creating many indexes concurrently. Anything
> causing aggressive eviction of buffer would do the trick, I think,
> triggering segfaults, asserts, etc.
>
>
> 2) bogus opclass definitions
>
> There were a couple opclasses referencing incorrect distance function,
> intended for a different data type. I was scratching my head WTH the
> regression tests pass, as there is a table to build multi-minmax index
> on all supported data types. The reason is pretty silly - the table is
> very small, just 100 rows, with very low fillfactor (so only couple
> values per page), and the index was created with pages_per_range=1. So
> the compaction was not triggered and the distance function was never
> actually called. I've decided to build the indexes on a larger data set
> first, to test this. But maybe this needs somewhat different approach.
>
>
> BLOOM
> -----
>
> The attached patch series addresses comments from the last review. As
> for the size limit, I've defined a new macro
>
> #define BloomMaxFilterSize \
> MAXALIGN_DOWN(BLCKSZ - \
> (MAXALIGN(SizeOfPageHeaderData + \
> sizeof(ItemIdData)) + \
> MAXALIGN(sizeof(BrinSpecialSpace)) + \
> SizeOfBrinTuple))
>
> and use that to determine if the bloom filter is too large. IMO that's
> close enough, considering that this is a best-effort check anyway (due
> to not being able to consider multi-column indexes).
>
>
> MINMAX-MULTI
> ------------
>
> As mentioned, there's a lot of fixes and improvements in this part, but
> the basic principle is still the same. I've kept it split into three
> parts with different approaches to building, so that it's possible to do
> benchmarks and comparisons, and pick the best one.
>
> a) 0005 - Aggressively compacts the summary, by always keeping it within
> the limit defined by values_per_range. So if the range contains more
> values, this may trigger compaction very often in some cases (e.g. for
> monotonic series).
>
> One drawback is that the more often the compactions happen, the less
> optimal the result is - the algorithm is kinda greedy, picking something
> like local optimums in each step.
>
> b) 0006 - Batch build, exactly the opposite of 0005. Accumulates all
> values in a buffer, then does a single round of compaction at the very
> end. This obviously doesn't have the "greediness" issues, but it may
> consume quite a bit of memory for some data types and/or indexes with
> large BRIN ranges.
>
> c) 0007 - A hybrid approach, using a buffer that is multiple of the
> user-specified value, with some safety min/max limits. IMO this is what
> we should use, although perhaps with some tuning of the exact limits.
>
>
> Attached is a spreadsheet with benchmark results for each of those three
> approaches, on different data types (byval/byref), data set types, index
> parameters (pages/values per range) etc. I think 0007 is a reasonable
> compromise overall, with performance somewhere in betwen 0005 and 0006.
> Of course, there are cases where it's somewhat slow, e.g. for data types
> with expensive comparisons and data sets forcing frequent compactions,
> in which case it's ~10x slower compared to regular minmax (in most cases
> it's ~1.5x). Compared to btree, it's usually much faster - ~2-3x as fast
> (except for some extreme cases, of course).
>
>
> As for the opclasses for indexes without "natural" distance function,
> implemented in 0008, I think we should drop it. In theory it works, but
> I'm far from convinced it's actually useful in practice. Essentially, if
> you have a data type with ordering but without a meaningful concept of a
> distance, it's hard to say what is an outlier. I'd bet most of those
> data types are used as "labels" where even the ordering is kinda
> useless, i.e. hardly anyone uses range queries on things like names,
> it's all just equality searches. Which means the bloom indexes are a
> much better match for this use case.
>
>
> The other thing we were considering was using the new multi-minmax
> opclasses as default ones, replacing the existing minmax ones. IMHO we
> shouldn't do that either. For existing minmax indexes that's useless
> (the opclass seems to be working, otherwise the index would be dropped).
> But even for new indexes I'm not sure it's the right thing, so I don't
> plan to change this.
>
>
> I'm also attaching the stress-test that I used to test the hell out of
> this, creating indexes on various data sets, data types, with varying
> index parameters, etc.
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message tsunakawa.takay@fujitsu.com 2021-02-04 00:56:49 RE: Parallel INSERT (INTO ... SELECT ...)
Previous Message Kyotaro Horiguchi 2021-02-04 00:43:49 Re: Correct comment in StartupXLOG().