Dear PostgreSQL Community,
I have a rather large database with ~250 billion records in a partitioned table. The database has been running and ingesting data continuously for about 3 years.
I had a "regular" BTree index on one of the fields (a unique bigint column) but it was getting too big for the disk it was on. The index was consuming 6.4 TB of disk space.
I created a new disk with double the size, dropped the original index, and started to generate a new index.
After doing some research I decided to try to create a hash index instead of a BTree. For my purposes, the index is only used to find specific numbers ("=" and "IN" queries). From what I read, the hash index should run a little faster than btree for my use case and should use less disk space.
After 115 hours, the hash index is still generating and is using significantly more disk space than the original BTree index (8.4 TB vs 6.4 TB). I don't know how to check the status of the index creation task so I can't really estimate how much longer it will take or how much disk space it will consume.
Questions:
(1) Why is the hash index consuming more disk space than the btree index? Is it because the hash of the bigint values larger than the storing the bigints in the btree?
(2) Are there any known issues having a hash index on this many records?
(3) Are there any known issues having a hash index on partitioned tables?
(4) Is there any way to estimate when the index process will complete?
Server info:
- PostgreSQL 13
- Ubuntu 20.04.6 LTS
- 64 cores (only 1 is ever used during index creation)
- Memory usage is steady at 58GB/188GB
- All disks are high speed NVMe drives
- 1,686 tables in the partition
Thanks in advance,
Peter