From: | Jeremy Finzel <finzelj(at)gmail(dot)com> |
---|---|
To: | David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Michael Lewis <mlewis(at)entrata(dot)com> |
Subject: | Re: BRIN index which is much faster never chosen by planner |
Date: | 2019-10-15 16:05:13 |
Message-ID: | CAMa1XUi-LicCQs8BquTiaV_Mrjk57Efroew=aKuZ9WnpQ26Rvw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Thank you for the thorough and thoughtful reply! Please see below.
On Mon, Oct 14, 2019 at 3:48 PM David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
wrote:
> Another thing which you might want to look at is the correlation
> column in the pg_stats view for the rec_insert_time column. Previous
> to 7e534adcd, BRIN index were costed based on the selectivity
> estimate. There was no accountability towards the fact that the pages
> for those records might have been spread out over the entire table.
> Post 7e534adcd, we use the correlation estimate to attempt to estimate
> how many pages (more specifically "ranges") we're likely to hit based
> on that and the selectivity estimate. This commit intended to fix the
> issue we had with BRIN indexes being selected far too often. Of
> course, the correlation is based on the entire table, if there are
> subsets of the table that are perhaps perfectly correlated, then the
> planner is not going to know about that. It's possible that some of
> your older rec_insert_times are spread out far more than the newer
> ones. As a test, you could try creating a new table and copying the
> records over to it in rec_insert_time order and seeing if the BRIN
> index is selected for that table (after having performed an ANALYZE).
>
> It would be interesting if you could show the pg_stats row for the
> column so that we can see if the correlation is low.
>
So what I said originally (and light bulbs now going off) is that the table
is insert-only, but it is **pruned (deletes) to the past year of data**. I
think this is the key of what I've missed. Once vacuum runs, we have a
bunch of old physical space being re-used by new inserts, thus botching
that good correlation between physical and logical order. So it appears
the physical order of the data is indeed well-correlated in recent history,
but not so when you go back a bit further. Here are pg_stats:
-[ RECORD 1 ]----------+---------------------------
schemaname | foo
tablename | log_table
attname | rec_insert_time
inherited | f
null_frac | 0
avg_width | 8
n_distinct | 1.89204e+06
correlation | 0.193951
most_common_elems |
most_common_elem_freqs |
elem_count_histogram |
I have not tried creating a fresh table, but if I modify my OP query to
instead take a window of 10 days 100 days ago, the BRIN index actually
performs really bad... identically to the seq scan:
Here is with a seq scan:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
WindowAgg (cost=26822167.70..26822170.60 rows=129 width=120) (actual
time=200730.856..200730.910 rows=62 loops=1)
-> Sort (cost=26822167.70..26822168.02 rows=129 width=104) (actual
time=200730.834..200730.837 rows=62 loops=1)
Sort Key: unique_cases.source, unique_cases.rec_insert_time
Sort Method: quicksort Memory: 33kB
-> Subquery Scan on unique_cases (cost=26822160.60..26822163.18
rows=129 width=104) (actual time=200730.672..200730.763 rows=62 loops=1)
-> Unique (cost=26822160.60..26822161.89 rows=129
width=124) (actual time=200730.670..200730.753 rows=62 loops=1)
-> Sort (cost=26822160.60..26822160.92 rows=129
width=124) (actual time=200730.668..200730.686 rows=395 loops=1)
Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
Sort Method: quicksort Memory: 80kB
-> Nested Loop (cost=0.00..26822156.08
rows=129 width=124) (actual time=200692.321..200730.121 rows=395 loops=1)
Join Filter: ((l.category)::text =
filter.category)
Rows Removed by Join Filter: 552210
-> Seq Scan on small_join_table filter
(cost=0.00..26.99 rows=1399 width=8) (actual time=0.013..0.179 rows=1399
loops=1)
-> Materialize (cost=0.00..26818970.84
rows=129 width=99) (actual time=136.942..143.440 rows=395 loops=1399)
-> Seq Scan on log_table l
(cost=0.00..26818970.20 rows=129 width=99) (actual
time=191581.193..200649.013 rows=395 loops=1)
Filter: ((field1 IS NOT NULL)
AND (category = 'music'::name) AND (rec_insert_time >= (now() - '100
days'::interval)) AND (rec_insert_time <= (now() - '90 days'::interval)))
Rows Removed by Filter:
315097963
Planning Time: 1.541 ms
Execution Time: 200731.273 ms
(19 rows)
Here is with the forced brin index scan:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
WindowAgg (cost=26674201.49..26674204.40 rows=129 width=120) (actual
time=200303.118..200303.177 rows=62 loops=1)
-> Sort (cost=26674201.49..26674201.82 rows=129 width=104) (actual
time=200303.093..200303.096 rows=62 loops=1)
Sort Key: unique_cases.source, unique_cases.rec_insert_time
Sort Method: quicksort Memory: 33kB
-> Subquery Scan on unique_cases (cost=26674194.39..26674196.97
rows=129 width=104) (actual time=200302.918..200303.012 rows=62 loops=1)
-> Unique (cost=26674194.39..26674195.68 rows=129
width=124) (actual time=200302.914..200302.998 rows=62 loops=1)
-> Sort (cost=26674194.39..26674194.71 rows=129
width=124) (actual time=200302.911..200302.929 rows=395 loops=1)
Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
Sort Method: quicksort Memory: 80kB
-> Nested Loop (cost=1245.13..26674189.87
rows=129 width=124) (actual time=138087.290..200301.925 rows=395 loops=1)
-> Bitmap Heap Scan on log_table l
(cost=1244.85..26673801.66 rows=129 width=99) (actual
time=138087.083..200298.259 rows=395 loops=1)
Recheck Cond: ((rec_insert_time >=
(now() - '100 days'::interval)) AND (rec_insert_time <= (now() - '90
days'::interval)))
Rows Removed by Index Recheck:
214939302
Filter: ((field1 IS NOT NULL) AND
(category = 'music'::name))
Rows Removed by Filter: 15326889
Heap Blocks: lossy=13566000
-> Bitmap Index Scan on
rec_insert_time_brin_1000 (cost=0.00..1244.81 rows=78608872 width=0)
(actual time=678.203..678.203 rows=135660000 loops=1)
Index Cond: ((rec_insert_time
>= (now() - '100 days'::interval)) AND (rec_insert_time <= (now() - '90
days'::interval)))
-> Index Only Scan using
small_join_table_pkey on small_join_table filter (cost=0.28..3.01 rows=1
width=8) (actual time=0.005..0.005 rows=1 loops=395)
Index Cond: (category =
(l.category)::text)
Heap Fetches: 395
Planning Time: 2.031 ms
Execution Time: 200304.411 ms
(23 rows)
> So you have just 466 rows matching these quals, but the executor had
> to scan 1.5 million pages to get those and filter out 8.1 million rows
> on the recheck then 19.8 million on the filter. You've mentioned that
> the table's heap is 139 GB, which is about 18 million pages. It seems
> your query would perform much better if you had a btree index such as
> (category, rec_insert_time) where field1 is not null;,
>
I agree btree would do the trick for performance, but I was trying to avoid
the near-100G overhead of such an index. For the given example, the
somewhat improved performance of BRIN may be acceptable to me. However, as
seen above, it appears that a btree may be my only option given this
workload to get reliable performance.
> Of course, you've mentioned that you are finding when the plan uses
> the BRIN index that it executes more quickly, but I think you're going
> to find BRIN unreliable for tables anything other than INSERT-only
> tables which the records are always inserted with an ever-increasing
> or decreasing value in the BRIN indexed column. If you start
> performing UPDATEs then that's going to create holes that new record
> will fill and cause the correlation to start dropping resulting in the
> BRIN indexes scan cost going up.
>
Or deletes, as in my case.
> On the other hand, if you think you can do better than what was done
> in 7e534adcd, then it would be good to see someone working on it. I'm
> sure something better can be done. It's just not that easy to do with
> the scant correlation data we have on the column.
>
> As for is this a bug or something that's missing from the documents.
> The documents do mention:
>
> "BRIN stands for Block Range Index. BRIN is designed for handling very
> large tables in which certain columns have some natural correlation
> with their physical location within the table."
>
Yes, I was aware of this, and perhaps nothing indeed needs to change with
docs here given my case. But perhaps it would be worth exploring if there
could be more detailed stats on physical vs logical correlation, such as
when ANALYZE takes its samples, noting physical locations as well as
logical values, and allowing the correlation to take account of that more
detailed analysis. Of course, sounds like a huge amount of work with
uncertain benefits. In my case, it could be said that if I am always
querying the last few days of data, a BRIN index here is perfect, and a
BTREE is way overkill. That is a real use case to consider. But more
generally, I would drop the BRIN index if I had any other query patterns
beyond the few days of data.
However, this may be a fallacy. It might be that a few days from now, the
last 10 days of data will actually be really fragmented, depending only on
when VACUUM runs.
As the docs state, I do believe that the only use case that will work
really well for BRIN is either a truly insert-only table which is never
pruned (an idea I dislike as a DBA who wants us to keep OLTP data trim and
implement data retention policies :), or a table which is routinely
CLUSTERed!
Thanks again for the detailed feedback.
Thanks,
Jeremy
From | Date | Subject | |
---|---|---|---|
Next Message | Justin Pryzby | 2019-10-15 16:40:47 | v12.0: interrupt reindex CONCURRENTLY: ccold: ERROR: could not find tuple for parent of relation ... |
Previous Message | Steven Winfield | 2019-10-15 13:52:52 | RE: BUG #16059: Tab-completion of filenames in COPY commands removes required quotes |