Re: Questions about btree_gin vs btree_gist for low cardinality columns

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Jeremy Finzel <finzelj(at)gmail(dot)com>
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Questions about btree_gin vs btree_gist for low cardinality columns
Date: 2019-06-02 16:25:07
Message-ID: CAMkU=1xpw_U6ZvymatbOXi5A=e8=NmCzyLmAiBKNO3B6qKqijQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Fri, May 24, 2019 at 11:26 AM Jeremy Finzel <finzelj(at)gmail(dot)com> wrote:

> I have been hoping for clearer direction from the community about
> specifically btree_gin indexes for low cardinality columns (as well as low
> cardinality multi-column indexes). In general there is very little
> discussion about this both online and in the docs. Rather, the emphasis
> for GIN indexes discussed is always on full text search of JSON indexing,
> not btree_gin indexes.
>
> However, I have never been happy with the options open to me for indexing
> low cardinality columns and was hoping this could be a big win. Often I
> use partial indexes as a solution, but I really want to know how many use
> cases btree_gin could solve better than either a normal btree or a partial
> index.
>

What does "low cardinality" mean here? For example, I have a million
different items with an item_id, but on average about 30 copies of each
item. The inventory table has a row for each copy (copies are not fully
interchangeable, so can't be just be a count in some other table). A
million is not usually considered a low number, but I find a gin index
(over btree_gin on the item_id) useful here as it produces a much smaller
index due to not repeating the item_id each time and due to compressing the
tid list, even though there are only about 30 tids to compress.

(You mention basically the reciprocal of this, 30 distinct values with
3,000,000 copies each, but I don't if that was your actual use case, or
just the case you were reading about in the documents you were reading)

>
> Here are my main questions:
>
> 1.
>
> "The docs say regarding *index only scans*: The index type must support
> index-only scans. B-tree indexes always do. GiST and SP-GiST indexes
> support index-only scans for some operator classes but not others. Other
> index types have no support. The underlying requirement is that the index
> must physically store, or else be able to reconstruct, the original data
> value for each index entry. As a counterexample, GIN indexes cannot support
> index-only scans because each index entry typically holds only part of the
> original data value."
>
> This is confusing to say "B-tree indexes always do" and "GIN indexes
> cannot support index-only scans", when we have a btree_gin index type.
> Explanation please ???
>

B-tree is the name of a specific index implementation in PostgreSQL. That
is what is referred to here. btree_gin offers operators to be used in a
GIN index to mimic/implement b-tree data structures/algorithms, but that
doesn't make it the same thing. It is just a GIN index doing btree things.
Perhaps PostgreSQL should use a "brand name" for their specific
implementation of the default index type, to distinguish it from the
generic algorithm description.

>
> Is it true that for a btree_gin index on a regular column, "each index
> entry typically holds only part of the original data value"? Do these
> still not support index only scans? Could they? I can't see why they
> shouldn't be able to for a single indexed non-expression field?
>

For single column using a btree_gin operator, each index entry holds the
entire data value. But the system doesn't know about that in any useful
way, so it doesn't implement index only scans, other than in the special
case where the value does not matter, like a 'count(*)'. Conceptually
perhaps this could be fixed, but I don't see it happening. Since an
index-only scan is usually not much good with only a single-column index, I
don't see much excitement to improve things here. If if there is more than
one column in the index, then for GIN the entire value from both columns
would not be stored in the same index entry, so in this case it can't use
an index-only scan even conceptually to efficiently fetch the value of one
column based on the supplied value of another one.

>
> 2.
>
> Lack of index only scans is definitely a downside. However, I see
> basically identical performance, but way less memory and space usage, for
> gin indexes. In terms of read-only performance, if index only scans are
> not a factor, why not always recommend btree_gin indexes instead of regular
> btree for low cardinality fields, which will yield similar performance but
> use far, far less space and resources?
>

GIN indexes over btree_gin operators do not support inequality or BETWEEN
queries efficiently. True read-onlyness is often talked about but rarely
achieved, I would be reluctant to design a system around management
promises that something won't ever change. Btree indexes are way more
thoroughly tested than GIN. When I became interested in using them in
more-or-less the way you describe, I started torture testing on them and
quickly found some bugs (hopefully all fixed now, but I wouldn't bet my
life on it). btree_gin covers the most frequently used data types, but far
from all of them. And as described above, multi-column GIN indexes are
just entirely different from multi-column B-Tree indexes, and generally not
very useful. And in the case of very low cardinality, I think indexing
those at all will often only be useful as a component in a regular
multi-column index, where the selectivity gets multiplied in with other
columns in an efficient way.

With those caveats in mind, I would and do recommend them. But where would
such a recommendation be stored, other than email archives or blog posts?
Is there a part of the documentation you would propose amending?

>
> 3.
>
> This relates to 2. I understand the write overhead can be much greater
> for GIN indexes, which is why the fastupdate feature exists. But again, in
> those discussions in the docs, it appears to me they are emphasizing that
> penalty more for full text or json GIN indexes. Does the same overhead
> apply to a btree_gin index on a regular column with no expressions?
>

It is not the same overhead. With FTS, inserting a row (or non-HOT
updating a row) without fastupdate means you would need to change a lot of
individual index leaf pages, one for each token in the document. With
btree_gin, that particular thing should not be a problem. If you have 30
values each with 1,000,000 rows, updating the gin index might cause more
WAL traffic than a similar B-Tree index, because more of the leaf page may
need to get modified in each update as you are recompressing a large list,
but the difference is not that large, maybe a factor of 2 in my tests, and
it might depend on many factors like fastupdate setting. I do recall that
the replay of GIN WAL onto a standby was annoying slow, but I haven't
tested it with your particular set up and not in the most recent versions.

Cheers,

Jeff

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom K 2019-06-02 18:14:04 Re: psql: FATAL: the database system is starting up
Previous Message Adrian Klaver 2019-06-02 15:46:57 Re: psql: FATAL: the database system is starting up