Quick Links

Re: maximum columns for brin bloom indexes

From:	Jaime Casanova <jcasanov(at)systemguards(dot)com(dot)ec>
To:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: maximum columns for brin bloom indexes
Date:	2021-04-08 14:49:18
Message-ID:	20210408144918.GA3086@ahch-to
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Apr 08, 2021 at 12:18:36PM +0200, Tomas Vondra wrote:
> On 4/8/21 9:08 AM, Jaime Casanova wrote:
> > Hi everyone,
> >
> > When testing brin bloom indexes I noted that we need to reduce the
> > PAGES_PER_RANGE parameter of the index to allow more columns on it.
> >
> > Sadly, this could be a problem if you create the index before the table
> > grows, once it reaches some number of rows (i see the error as early as
> > 1000 rows) it starts error out.
> >
> > create table t1(i int, j int);
> >
> > -- uses default PAGES_PER_RANGE=128
> > create index on t1 using brin(i int4_bloom_ops, j int4_bloom_ops ) ;
> >
> > insert into t1
> > select random()*1000, random()*1000 from generate_series(1, 1000);
> > ERROR: index row size 8968 exceeds maximum 8152 for index "t1_i_j_idx"
> >
> > if instead you create the index with a minor PAGES_PER_RANGE it goes
> > fine, in this case it works once you reduce it to at least 116
> >
> > create index on t1 using brin(i int4_bloom_ops, j int4_bloom_ops )
> > with (pages_per_range=116);
> >
> >
> > so, for having:
> > two int columns - PAGES_PER_RANGE should be max 116
> > three int columns - PAGES_PER_RANGE should be max 77
> > one int and one timestamp - PAGES_PER_RANGE should be max 121
> >
> > and so on
> >
>
> No, because this very much depends on the number if distinct values in
> the page page range, which determines how well the bloom filter
> compresses. You used 1000, but that's just an arbitrary value and the
> actual data might have any other value. And it's unlikely that all three
> columns will have the same number of distinct values.
>

Ok, that makes sense. Still I see a few odd things:

"""
drop table if exists t1;
create table t1(i int, j int);
create index on t1 using brin(i int4_bloom_ops, j int4_bloom_ops ) ;

-- This one will succeed, I guess because it has less different
-- values
insert into t1
select random()*20, random()*100 from generate_series(1, 1000);

-- succeed
insert into t1
select random()*20, random()*100 from generate_series(1, 100000);

-- succeed
insert into t1
select random()*200, random()*1000 from generate_series(1, 1000);

-- succeed? This is the case it has been causing problems before
insert into t1
select random()*1000, random()*1000 from generate_series(1, 1000);
"""

Maybe this makes sense, but it looks random to me. If it makes sense
this is something we should document better.

Let's try another combination:

"""
drop table if exists t1;
create table t1(i int, j int);
create index on t1 using brin(i int4_bloom_ops, j int4_bloom_ops ) ;

-- this fails again
insert into t1
select random()*1000, random()*1000 from generate_series(1, 1000);

-- and this starts to fail now, but this worked before
insert into t1
select random()*20, random()*100 from generate_series(1, 1000);
"""

> Of course, this also depends on the false positive rate.
>

How the false positive rate work?

> FWIW I doubt people are using multi-column BRIN indexes very often.
>

true.

Another question, should we allow to create a brin multi column index
that uses different opclasses?

CREATE INDEX ON t1 USING brin (i int4_bloom_ops, j int4_minmax_ops);

--
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL

In response to

Re: maximum columns for brin bloom indexes at 2021-04-08 10:18:36 from Tomas Vondra

Responses

Re: maximum columns for brin bloom indexes at 2021-04-08 14:57:59 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Julien Rouhaud	2021-04-08 14:54:09	Re: pg_stat_statements oddity with track = all
Previous Message	David Steele	2021-04-08 14:45:58	Re: allow partial union-all and improve parallel subquery costing