Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

From: Oleg Ivanov <o(dot)ivanov(at)postgrespro(dot)ru>
To: obartunov(at)gmail(dot)com, Nikolay Samokhvalov <samokhvalov(at)gmail(dot)com>
Cc: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)
Date: 2017-12-12 17:02:04
Message-ID: 5A300B8C.9020000@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/12/2017 04:33 PM, Oleg Bartunov wrote:
> On Mon, Dec 11, 2017 at 11:11 PM, Nikolay Samokhvalov
> <samokhvalov(at)gmail(dot)com> wrote:
>> Very interesting read: https://arxiv.org/abs/1712.01208
>>
>> HN discussion: https://news.ycombinator.com/item?id=15894896
>>
>> Some of the comments (from Twitter
>> https://twitter.com/schrockn/status/940037656494317568): "Jeff Dean and co
>> at GOOG just released a paper showing how machine-learned indexes can
>> replace B-Trees, Hash Indexes, and Bloom Filters. Execute 3x faster than
>> B-Trees, 10-100x less space. Executes on GPU, which are getting faster
>> unlike CPU. Amazing."
>>
>> Can those ideas be applied to Postgres in its current state? Or it's not
>> really down-to-earth?
> Oleg made some analysis of the paper.
If the keys are comparable and the data distribution is simple enough
(which seems to happen rather often) and the data does not change, we
can learn the distribution, and so perform the search faster, and save
the memory also. That is what authors wrote and that is what must work
in my opinion.

The limitations of the approach, which authors mention in their work:
+ The data does not change. The proposed method can be extended more or
less straightforward to nearly append-only workloads and to workloads in
which the data distribution does not change or changes very slowly (the
data itself or its amount may change, nevertheless). Otherwise it is
challenging, because the key positions cannot be considered as
independent values anymore, but it looks still possible. The other
proposed by the authors option is to store new objects in buffer and
retrain model in the background, but that does not look as a nice solution.
+ GPU are fast, but the latency of doing operations on the GPU almost
certainly avoids all speed benefits for such small models. The only case
in which it is reasonable to use GPU is training models (i. e. building
such indexes).

The following left unclear for me from the paper:
+ How effectively the approach can be implemented taking into account
the limitations above.
+ For hash functions authors propose to replace hash function with the
learned CDF, which can decrease the amount of collisions, which decreaes
the memory consumption. That is reasonable, but this hash function
implicitly uses the information about the order on the keys. I suppose
that using the ordering on the data and with access to the data
histogram one could achieve similar memory consumption decrease.
+ The proposed hierarchical learning looks natural for the problem, but
it is not clear that it is the best option. For example, one might use
the end-to-end learning procedure with weights sparsity regularizers as
well. I wonder whether the last approach can obtain the competitive
results with the first one, and if not, then why. Anyway, I have a
feeling that the proposed method has a room for improvement.

In general, the approach still needs some additional research (mostly
about the changing data), and has some points to think about how to
implement them properly (paging, for example), but it looks promising
enough nevertheless.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Oleg Ivanov 2017-12-12 17:26:36 Re: Learned Index
Previous Message Geoff Winkless 2017-12-12 17:00:22 Re: proposal: alternative psql commands quit and exit