Re: Questions about btree_gin vs btree_gist for low cardinality columns

From: Morris de Oryx <morrisdeoryx(at)gmail(dot)com>
To: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Questions about btree_gin vs btree_gist for low cardinality columns
Date: 2019-06-01 07:44:00
Message-ID: CAKqncciUyxY04GagOWYALE7+zOeF9a+AD=gYWV8PvJROw_YN7Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Since I've been wondering about this subject, I figured I'd take a bit of
time and try to do some tests. I'm not new to databases or coding, but have
been using Postgres for less than two years. I haven't tried to generate
large blocks of test data directly in Postgres before, so I'm *sure* that
there are better ways to do what I've done here. No worries, this gave me a
chance to work through at least some of the questions/problems in setting
up and running tests.

Anyway, I populated a table with 1M rows of data with very little in them,
just a two-character state abbreviation. There are only 59 values, and the
distribution is fairly even as I used random() without any tricks to shape
the distribution. So, each value is roughly 1/60th of the total row count.
Not realistic, but what I've got.

For this table, I built four different kind of index and tried each one out
with a count(*) query on a single exact match. I also checked out the size
of each index.

Headline results:

Partial index: Smaller (as expeced), fast.
B-tree index: Big, fast.
GIN: Small, slow.
Hash: Large, slow. ("Large" may be exaggerated in comparison with a B-tree
because of my test data.)

I'm wondering how much impact it had that I used such very small strings,
and how much difference it made that the data was so evenly distributed.

If anyone has any insights, I'd be grateful to hear them. I'm posting the
various bits of code involved below for anyone following along at home.

First, my version string as that can make a difference (we deploy on RDS, I
develop on macOS):

PostgreSQL 11.3 on x86_64-apple-darwin16.7.0, compiled by Apple LLVM
version 8.1.0 (clang-802.0.42), 64-bit

There's a simple lookup table with the 59 abbreviations
(AL,AR,AS,AZ,CA,CO,CT,DC,DE,FL,FM,GA,GU,HI,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MH,MI,MN,MO,MP,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,PR,PW,RI,SC,SD,TN,TX,UT,VA,VI,VT,WA,WI,WV,WY)

BEGIN;
CREATE TABLE IF NOT EXISTS api.state (
abbr text
);
COMMIT;

And here's the test data table definition:

BEGIN;
CREATE TABLE IF NOT EXISTS ascendco.state_test (
abbr text,
num integer -- I didn't end up using this.
);
COMMIT;

I wanted to create 1M rows and bashed around with generate_series,
recursive CTEs...and didn't get it working. So I wrote a tiny function
instead:

/* random() produces are pretty consistent distribution of numbers.
For ideas on generating other distributions, this piece looks good:
https://chartio.com/learn/sql/random-sequences/
*/

CREATE OR REPLACE FUNCTION api.generate_state_test_rows (loop_max int)
RETURNS int AS $$

DECLARE
counter integer := 0;

BEGIN
IF (loop_max < 1) THEN
RETURN 0 ;
END IF;

WHILE counter <= loop_max LOOP
counter := counter + 1 ;

insert into state_test (num,abbr)

values (
random() * 1000000, -- Get a random number between
0-1,000,000.
(select abbr -- Get a random state abbreviation out of our
tiny related table.
from state
order by random()
limit 1)
);

END LOOP ;

RETURN 1;
END;

$$ LANGUAGE plpgsql

The horror. The horror. But it works:

select * from generate_state_test_rows(1000000);

Okay, so that's the data set up. Next, the indexes and their sizes:

DROP INDEX IF EXISTS abbr_partial_ma;
DROP INDEX IF EXISTS abbr_btree;
DROP INDEX IF EXISTS abbr_gin;
DROP INDEX IF EXISTS abbr_hash;

CREATE INDEX abbr_partial_ma ON state_test(abbr) WHERE abbr = 'MA';
CREATE INDEX abbr_btree ON state_test USING btree (abbr);
CREATE INDEX abbr_gin ON state_test USING gin (abbr);
CREATE INDEX abbr_hash ON state_test USING hash (abbr);

select 'Partial' as method, pg_table_size('abbr_partial_ma'),
pg_table_size('abbr_partial_ma') / 1024 || ' Kb' as "kb" union all
select 'B tree' as method, pg_table_size('abbr_btree'),
pg_table_size('abbr_btree') / 1024 || ' Kb' as "kb" union all
select 'GIN' as method, pg_table_size('abbr_gin'),
pg_table_size('abbr_gin') / 1024 || ' Kb' as "kb" union all
select 'Hash' as method, pg_table_size('abbr_hash'),
pg_table_size('abbr_hash') / 1024 || ' Kb' as "kb"

method pg_table_size kb
Partial 401408 392 Kb
B tree 22487040 21960 Kb
GIN 1916928 1872 Kb
Hash 49250304 48096 Kb

Okay, so the partial index is smaller, basically proportional to the
fraction of the file it's indexing. So that makes sense, and is good to
know. The hash index size is...harder to explain...very big. Maybe my tiny
strings? Not sure what size Postgres hashes to. A hash of a two character
string is likely about worst-case. The B-tree is pretty big.

Okay, timing tests. I was hoping that the GIN would do well, but it didn't.
Here are some explain dumps.

B-tree (partial, just on this one value)
Aggregate (cost=389.43..389.44 rows=1 width=8)
-> Index Only Scan using abbr_btree on state_test (cost=0.42..346.68
rows=17100 width=0)
Index Cond: (abbr = 'MA'::text)

B-tree on whole table
Aggregate (cost=389.43..389.44 rows=1 width=8)
-> Index Only Scan using abbr_btree on state_test (cost=0.42..346.68
rows=17100 width=0)
Index Cond: (abbr = 'MA'::text)

GIN (btree_gin, hopefully - I created the extension)
Aggregate (cost=4867.03..4867.04 rows=1 width=8)
-> Bitmap Heap Scan on state_test (cost=140.53..4824.28 rows=17100
width=0)
Recheck Cond: (abbr = 'MA'::text)
-> Bitmap Index Scan on abbr_gin (cost=0.00..136.25 rows=17100
width=0)
Index Cond: (abbr = 'MA'::text)

Hash index
Aggregate (cost=4915.00..4915.01 rows=1 width=8)
-> Index Scan using abbr_hash on state_test (cost=0.00..4872.25
rows=17100 width=0)
Index Cond: (abbr = 'MA'::text)

No index
Finalize Aggregate (cost=10696.37..10696.38 rows=1 width=8)
-> Gather (cost=10696.15..10696.36 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=9696.15..9696.16 rows=1 width=8)
-> Parallel Seq Scan on state_test (cost=0.00..9678.34
rows=7125 width=0)
Filter: (abbr = 'MA'::text)

On Sat, Jun 1, 2019 at 12:52 PM Morris de Oryx <morrisdeoryx(at)gmail(dot)com>
wrote:

> Jeremy's question is *great*, and really well presented. I can't answer
> his questions, but I am keenly interested in this subject as well. The
> links he provides lead to some really interesting and well-though-out
> pieces, well worth reading.
>
> I'm going to try restating things in my own way in hopes of getting some
> good feedback and a basic question:
>
> *What are the best ways to index low cardinality values in Postgres?*
>
> For an example, imagine an address table with 100M US street addresses
> with two character state abbreviations. So, say there are around 60 values
> in there (the USPS is the mail system for a variety of US territories,
> possessions and friends in the Pacific.) Okay, so what's the best index
> type for state abbreviation? For the sake of argument, assume a normal
> distribution so something like FM (Federated States of Micronesia) is on a
> tail end and CA or NY are a whole lot more common.
>
> A *B-tree* is obviously a pretty *bad match* for this sort of situation.
> It works, but B-trees are ideal for *unique* values, and super large for
> repeated values. Not getting into the details or Postgres specifics of
> various kinds of traditional B-trees. (I think B*?) Doesn't matter. You
> have a huge index because the index size is closely related to the number
> of *rows*, not the number of *distinct values*.
>
> Alternatively, you could set up *partial indexes* for the distinct
> values, like so:
>
> Running 10 Million PostgreSQL Indexes In Production (And Counting)
>
> https://heap.io/blog/engineering/running-10-million-postgresql-indexes-in-production
>
> Like Jeremy, I've wondered about *GIN indexes* for low-cardinality
> columns. Has anyone tried this out in PG 10 or 11? It sounds like a good
> idea. As I understand it, GIN indexes are something like a B-tree of unique
> values that link to another data structure, like a tree, bitmap, etc. So,
> in my imaginary example, there are 60 nodes for the state codes [internally
> there would be more for padding free nodes, evenly sized pages, etc....just
> pretend there are 60] and then linked to that, 60 data structures with the
> actual row references. Somehow.
>
> It can imagine things going quite well with a GIN or btree_gin. I can also
> imagine that the secondary data structure could get bloaty, slow, and
> horrible. I've worked with a system which uses bitmaps as the secondary
> structure (at least some of the time), and it can work quite well...not
> sure how it's implemented in Postgres.
>
> So, does anyone have any links or results about efficiently (space and/or
> time) indexing Boolean and other low-cardinality columns in Postgres? I'm
> on PG 11, but figure many are on 9.x or 10.x.
>
> References and results much appreciated.
>
> Thanks!
>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Gavin Flower 2019-06-01 08:24:00 Re: Questions about btree_gin vs btree_gist for low cardinality columns
Previous Message Tom K 2019-06-01 02:53:35 Re: psql: FATAL: the database system is starting up