Re: PATCH: Using BRIN indexes for sorted output

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Greg Stark <stark(at)mit(dot)edu>, Zhihong Yu <zyu(at)yugabyte(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: PATCH: Using BRIN indexes for sorted output
Date: 2023-02-24 19:14:12
Message-ID: 9799009d-fa32-2245-e6a2-155920436708@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2/24/23 19:03, Matthias van de Meent wrote:
> On Thu, 23 Feb 2023 at 19:48, Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>
>> On 2/23/23 17:44, Matthias van de Meent wrote:
>>> On Thu, 23 Feb 2023 at 16:22, Tomas Vondra
>>> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>>>
>>>> On 2/23/23 15:19, Matthias van de Meent wrote:
>>>>> Comments on 0001, mostly comments and patch design:
>>>
>>> One more comment:
>>>
>>>>>> + range->min_value = bval->bv_values[0];
>>>>>> + range->max_value = bval->bv_values[1];
>>>
>>> This produces dangling pointers for minmax indexes on by-ref types
>>> such as text and bytea, due to the memory context of the decoding
>>> tuple being reset every iteration. :-(
>>>
>>
>> Yeah, that sounds like a bug. Also a sign the tests should have some
>> by-ref data types (presumably there are none, as that would certainly
>> trip some asserts etc.).
>
> I'm not sure we currently trip asserts, as the data we store in the
> memory context is very limited, making it unlikely we actually release
> the memory region back to the OS.
> I did get assertion failures by adding the attached patch, but I don't
> think that's something we can do in the final release.
>

But we should randomize the memory if we ever do pfree(), and it's
strange valgrind didn't complain when I ran tests with it. But yeah,
I'll take a look and see if we can add some tests covering this.

>>> With the proposed patch, we do O(ncols_statsenabled) scans over the
>>> BRIN index. Each scan reads all ncol columns of all block ranges from
>>> disk, so in effect the data scan does on the order of
>>> O(ncols_statsenabled * ncols * nranges) IOs, or O(n^2) on cols when
>>> all columns have statistics enabled.
>>>
>>
>> I don't think that's the number of I/O operations we'll do, because we
>> always read the whole BRIN tuple at once. So I believe it should rather
>> be something like
>>
>> O(ncols_statsenabled * nranges)
>>
>> assuming nranges is the number of page ranges. But even that's likely a
>> significant overestimate because most of the tuples will be served from
>> shared buffers.
>
> We store some data per index attribute, which makes the IO required
> for a single indexed range proportional to the number of index
> attributes.
> If we then scan the index a number of times proportional to the number
> of attributes, the cumulative IO load of statistics gathering for that
> index is quadratic on the number of attributes, not linear, right?
>

Ah, OK. I was focusing on number of I/O operations while you're talking
about amount of I/O performed. You're right the amount of I/O is
quadratic, but I think what's more interesting is the comparison of the
alternative ANALYZE approaches. The current simple approach does a
multiple of the I/O amount, linear to the number of attributes.

Which is not great, ofc.

>> Considering how tiny BRIN indexes are, this is likely orders of
>> magnitude less I/O than we expend on sampling rows from the table. I
>> mean, with the default statistics target we read ~30000 pages (~240MB)
>> or more just to sample the rows. Randomly, while the BRIN index is
>> likely scanned mostly sequentially.
>
> Mostly agreed; except I think it's not too difficult to imagine a BRIN
> index that is on that scale; with as an example the bloom filters that
> easily take up several hundreds of bytes.
>
> With the default configuration of 128 pages_per_range,
> n_distinct_per_range of -0.1, and false_positive_rate of 0.01, the
> bloom filter size is 4.36 KiB - each indexed item on its own page. It
> is still only 1% of the original table's size, but there are enough
> tables that are larger than 24GB that this could be a significant
> issue.
>

Right, it's certainly true BRIN indexes may be made fairly large (either
by using something like bloom or by making the ranges much smaller). But
those are exactly the indexes where building statistics for all columns
at once is going to cause issues with memory usage etc.

Note: Obviously, that depends on how much data per range we need to keep
in memory. For bloom I doubt we'd actually want to keep all the filters,
we'd probably calculate just some statistics (e.g. number of bits set),
so maybe the memory consumption is not that bad.

In fact, for such indexes the memory consumption may already be an issue
even when analyzing the index column by column. My feeling is we'll need
to do something about that, e.g. by reading only a sample of the ranges,
or something like that. That might help with the I/O too, I guess.

> Note that most of my concerns are related to our current
> documentation's statement that there are no demerits to multi-column
> BRIN indexes:
>
> """
> 11.3. Multicolumn Indexes
>
> [...] The only reason to have multiple BRIN indexes instead of one
> multicolumn BRIN index on a single table is to have a different
> pages_per_range storage parameter.
> """
>

True, we may need to clarify this in the documentation.

> Wide BRIN indexes add IO overhead for single-attribute scans when
> compared to single-attribute indexes, so having N single-attribute
> index scans to build statistics the statistics on an N-attribute index
> is not great.
>

Perhaps, but it's not like the alternative is perfect either
(complexity, memory consumption, ...). IMHO it's a reasonable tradeoff.

>> Maybe there are cases where this would be an issue, but I haven't seen
>> one when working on this patch (and I did a lot of experiments). I'd
>> like to see one before we start optimizing it ...
>
> I'm not only worried about optimizing it, I'm also worried that we're
> putting this abstraction at the wrong level in a way that is difficult
> to modify.
>

Yeah, that's certainly a valid concern. I admit I only did the minimum
amount of work on this part, as I was focused on the sorting part.

>> This also reminds me that the issues I actually saw (e.g. memory
>> consumption) would be made worse by processing all columns at once,
>> because then you need to keep more columns in memory.
>
> Yes, I that can be a valid concern, but don't we already do similar
> things in the current table statistics gathering?
>

Not really, I think. We sample a bunch of rows once, but then we build
statistics on this sample for each attribute / expression independently.
We could of course read the whole index into memory and then run the
analysis, but I think tuples tend to be much smaller (thanks to TOAST
etc.) and we only really scan a limited amount of them (30k).

But if we're concerned about the I/O, the BRIN is likely fairly large,
so maybe reading it into memory at once is not a great idea.

It's be more similar if we sampled sampled the BRIN ranges - I think
we'll have to do something like that anyway.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2023-02-24 19:18:40 zstd compression for pg_dump
Previous Message CK Tan 2023-02-24 18:51:12 PG_FREE_IF_COPY extraneous in numeric_cmp?