Re: Questions about btree_gin vs btree_gist for low cardinality columns

From: Jeremy Finzel <finzelj(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Questions about btree_gin vs btree_gist for low cardinality columns
Date: 2019-06-04 13:37:22
Message-ID: CAMa1XUg_1xLyKshjB-oNa1L41QukrTj3hf226hYHaP3kgpj2qA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sun, Jun 2, 2019 at 6:07 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Anyway, the larger point here is that right now btree_gin is just a quick
> hack, and it seems like it might be worth putting some more effort into
> it, because the addition of duplicate-compression changes the calculus
> for whether it's useful.
>

Thank you to all for the thoughtful and thorough replies!

To clarify, I am definitely more interested in "low cardinality" in the
sense NOT of a boolean or very few values, but rather enough values to
where:

1. It's selective enough that having *some kind of index* actually will
significantly speed up queries vs. a sequential scan
2. There are too many values to use partial indexes easily without it
becoming a kind of maintenance nightmare

In our environment, we happen to have this kind of situation a lot. For
example, account codes, or other foreign key ids to lookup tables that
categorize data in some way that is frequently queried, exist on tables
with upwards of 100 million rows. Typically it may be something like 50 to
500 unique values.

Historically, we have just had a bunch of regular btree indexes on these,
and there are quite a lot of them indeed.

Here is my specific example:

- Table has 127 million rows, including a toast field. The table is
270GB
- The filter is on a field with only 16 unique values.
- The actual filter condition is filtering a join to 4 of the 16 unique
values

As I said, performance was nearly identical for btree vs. gin. But I was
much happier with the memory usage of GIN given its tiny size:

- Memory for join using btree: Buffers: shared hit=12 read=328991
- Memory for join using gin: Buffers: shared hit=12 read=13961

The btree index here is 3.8GB, whereas the gin index is only 200MB. But I
have not tested how either index handles bloat.

Morris, I think the reason your gin index was clearly slower was precisely
because your example uses an index-only scan - which gin does not support.

Thanks,
Jeremy

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Matthias Apitz 2019-06-04 15:03:18 Re: Two small questions re/ COPY CSV data into table
Previous Message Steve Atkins 2019-06-04 12:44:34 Re: Research on ?? operators