| From: | Jaime Casanova <jcasanov(at)systemguards(dot)com(dot)ec> | 
|---|---|
| To: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> | 
| Cc: | pgsql-hackers(at)postgresql(dot)org | 
| Subject: | Re: maximum columns for brin bloom indexes | 
| Date: | 2021-04-08 14:49:18 | 
| Message-ID: | 20210408144918.GA3086@ahch-to | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On Thu, Apr 08, 2021 at 12:18:36PM +0200, Tomas Vondra wrote:
> On 4/8/21 9:08 AM, Jaime Casanova wrote:
> > Hi everyone,
> > 
> > When testing brin bloom indexes I noted that we need to reduce the
> > PAGES_PER_RANGE parameter of the index to allow more columns on it.
> > 
> > Sadly, this could be a problem if you create the index before the table
> > grows, once it reaches some number of rows (i see the error as early as
> > 1000 rows) it starts error out.
> > 
> > 	create table t1(i int, j int);
> > 	
> > 	-- uses default PAGES_PER_RANGE=128
> > 	create index on t1 using brin(i int4_bloom_ops, j int4_bloom_ops ) ;
> > 	
> > 	insert into t1 
> > 		select random()*1000, random()*1000 from generate_series(1, 1000);
> > 	ERROR:  index row size 8968 exceeds maximum 8152 for index "t1_i_j_idx"
> > 
> > if instead you create the index with a minor PAGES_PER_RANGE it goes
> > fine, in this case it works once you reduce it to at least 116
> > 
> > 	create index on t1 using brin(i int4_bloom_ops, j int4_bloom_ops ) 
> > 		with (pages_per_range=116);
> > 
> > 
> > so, for having:
> > two int columns - PAGES_PER_RANGE should be max 116
> > three int columns - PAGES_PER_RANGE should be max 77
> > one int and one timestamp - PAGES_PER_RANGE should be max 121 
> > 
> > and so on
> > 
> 
> No, because this very much depends on the number if distinct values in
> the page page range, which determines how well the bloom filter
> compresses. You used 1000, but that's just an arbitrary value and the
> actual data might have any other value. And it's unlikely that all three
> columns will have the same number of distinct values.
>
Ok, that makes sense. Still I see a few odd things:
	"""
	drop table if exists t1;
	create table t1(i int, j int);
	create index on t1 using brin(i int4_bloom_ops, j int4_bloom_ops ) ;
	-- This one will succeed, I guess because it has less different
	-- values
	insert into t1
	select random()*20, random()*100 from generate_series(1, 1000);
	-- succeed
	insert into t1
	select random()*20, random()*100 from generate_series(1, 100000);
	-- succeed
	insert into t1
	select random()*200, random()*1000 from generate_series(1, 1000);
	-- succeed
	insert into t1
	select random()*200, random()*1000 from generate_series(1, 1000);
	-- succeed? This is the case it has been causing problems before
	insert into t1
	select random()*1000, random()*1000 from generate_series(1, 1000);
	"""
Maybe this makes sense, but it looks random to me. If it makes sense
this is something we should document better. 
Let's try another combination:
	"""
	drop table if exists t1;
	create table t1(i int, j int);
	create index on t1 using brin(i int4_bloom_ops, j int4_bloom_ops ) ;
	-- this fails again
	insert into t1
	select random()*1000, random()*1000 from generate_series(1, 1000);
	-- and this starts to fail now, but this worked before
	insert into t1
	select random()*20, random()*100 from generate_series(1, 1000);
	"""
> Of course, this also depends on the false positive rate.
> 
How the false positive rate work?
> FWIW I doubt people are using multi-column BRIN indexes very often.
> 
true.
Another question, should we allow to create a brin multi column index
that uses different opclasses?
CREATE INDEX ON t1 USING brin (i int4_bloom_ops, j int4_minmax_ops);
-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Julien Rouhaud | 2021-04-08 14:54:09 | Re: pg_stat_statements oddity with track = all | 
| Previous Message | David Steele | 2021-04-08 14:45:58 | Re: allow partial union-all and improve parallel subquery costing |