Re: Questions about btree_gin vs btree_gist for low cardinality columns

From: Morris de Oryx <morrisdeoryx(at)gmail(dot)com>
To: "Peter J(dot) Holzer" <hjp-pgsql(at)hjp(dot)at>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Questions about btree_gin vs btree_gist for low cardinality columns
Date: 2019-06-01 23:10:25
Message-ID: CAKqncch73a0rff1sUb17eFCivLMV+sWhzxc2Bx2o_q1p3H4jGA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Peter, thanks a lot for picking up on what I started, improving it, and
reporting back. I *thought *I was providing timing estimates from the
EXPLAIN cost dumps. Seems not. Well, there's another thing that I've
learned.

Your explanation of why the hash index bloats out makes complete sense, I
ought to have thought that.

Can you tell me how you get timing results into state_test_times? I know
how to turn on time display in psql, but I much prefer to use straight SQL.
The reason for that is my production code is always run through a SQL
session, not typing things into psql.

On Sat, Jun 1, 2019 at 11:53 PM Peter J. Holzer <hjp-pgsql(at)hjp(dot)at> wrote:

> On 2019-06-01 17:44:00 +1000, Morris de Oryx wrote:
> > Since I've been wondering about this subject, I figured I'd take a bit
> of time
> > and try to do some tests. I'm not new to databases or coding, but have
> been
> > using Postgres for less than two years. I haven't tried to generate large
> > blocks of test data directly in Postgres before, so I'm sure that there
> are
> > better ways to do what I've done here. No worries, this gave me a chance
> to
> > work through at least some of the questions/problems in setting up and
> running
> > tests.
> >
> > Anyway, I populated a table with 1M rows of data with very little in
> them, just
> > a two-character state abbreviation. There are only 59 values, and the
> > distribution is fairly even as I used random() without any tricks to
> shape the
> > distribution. So, each value is roughly 1/60th of the total row count.
> Not
> > realistic, but what I've got.
> >
> > For this table, I built four different kind of index and tried each one
> out
> > with a count(*) query on a single exact match. I also checked out the
> size of
> > each index.
> >
> > Headline results:
> >
> > Partial index: Smaller (as expeced), fast.
> > B-tree index: Big, fast.
> > GIN: Small, slow.
> > Hash: Large, slow. ("Large" may be exaggerated in comparison with a
> B-tree
> > because of my test data.)
>
> You didn't post any times (or the queries you timed), so I don't know
> what "fast" and "slow" mean.
>
> I used your setup to time
> select sum(num) from state_test where abbr = 'MA';
> on my laptop (i5-7Y54, 16GB RAM, SSD, Pgsql 10.8) and here are the
> results:
>
> hjp=> select method, count(*),
> min(time_ms),
> avg(time_ms),
> percentile_cont(0.5) within group (order by time_ms) as median,
> max(time_ms)
> from state_test_times
> group by method
> order by 5;
>
> method | count | min | avg | median | max
> ---------+-------+--------+---------+--------+--------
> Partial | 5 | 9.05 | 9.7724 | 9.185 | 12.151
> B tree | 5 | 9.971 | 12.8036 | 10.226 | 21.6
> GIN | 5 | 9.542 | 10.3644 | 10.536 | 10.815
> Hash | 5 | 10.801 | 11.7448 | 11.047 | 14.875
>
> All the times are pretty much the same. GIN is third by median, but the
> difference is tiny, and it is secondy by minium and average and even
> first by maximum.
>
> In this case all the queries do a bitmap scan, so the times are probably
> dominated by reading the heap, not the index.
>
> > method pg_table_size kb
> > Partial 401408 392 Kb
> > B tree 22487040 21960 Kb
> > GIN 1916928 1872 Kb
> > Hash 49250304 48096 Kb
>
> I get the same sizes.
>
>
> > Okay, so the partial index is smaller, basically proportional to the
> fraction
> > of the file it's indexing. So that makes sense, and is good to know.
>
> Yeah. But to cover all values you would need 59 partial indexes, which
> gets you back to the size of the full btree index. My test shows that it
> might be faster, though, which might make the hassle of having to
> maintain a large number of indexes worthwhile.
>
> > The hash index size is...harder to explain...very big. Maybe my tiny
> > strings? Not sure what size Postgres hashes to. A hash of a two
> > character string is likely about worst-case.
>
> I think that a hash index is generally a poor fit for low cardinality
> indexes: You get a lot of equal values, which are basically hash
> collisions. Unless the index is specifically designed to handle this
> (e.g. by storing the key only once and then a tuple list per key, like a
> GIN index does) it will balloon out trying to reduce the number of
> collisions.
>
> hp
>
> --
> _ | Peter J. Holzer | we build much bigger, better disasters now
> |_|_) | | because we have much more sophisticated
> | | | hjp(at)hjp(dot)at | management tools.
> __/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2019-06-01 23:12:52 Re: psql: FATAL: the database system is starting up
Previous Message Tom K 2019-06-01 22:56:14 Re: psql: FATAL: the database system is starting up