Re: Limit Heap Fetches / Rows Removed by Filter in Index Scans

From: Sameer Kumar <sameer(dot)kumar(at)ashnik(dot)com>
To: Victor Blomqvist <vb(at)viblo(dot)se>
Cc: PostgreSQL General Discussion Forum <pgsql-general(at)postgresql(dot)org>
Subject: Re: Limit Heap Fetches / Rows Removed by Filter in Index Scans
Date: 2016-08-19 09:16:42
Message-ID: CADp-Sm7GSO_=7iOwarX9-ScwA9GOoeiCMZxcztyY7_qqvUphWw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Fri, Aug 19, 2016 at 2:25 PM Victor Blomqvist <vb(at)viblo(dot)se> wrote:

> On Fri, Aug 19, 2016 at 1:31 PM, Sameer Kumar <sameer(dot)kumar(at)ashnik(dot)com>
> wrote:
>
>>
>>
>> On Fri, 19 Aug 2016, 1:07 p.m. Victor Blomqvist, <vb(at)viblo(dot)se> wrote:
>>
>>> Hi,
>>>
>>> Is it possible to break/limit a query so that it returns whatever
>>> results found after having checked X amount of rows in a index scan?
>>>
>>> For example:
>>> create table a(id int primary key);
>>> insert into a select * from generate_series(1,100000);
>>>
>>> select * from a
>>> where id%2 = 0
>>> order by id limit 10
>>>
>>> In this case the query will "visit" 20 rows and filter out 10 of them.
>>> We can see that in the query plan:
>>> "Rows Removed by Filter: 10"
>>> "Heap Fetches: 20"
>>>
>>> Is it somehow possible to limit this query so that it only fetches X
>>> amount, in my example if we limited it to 10 Heap Fetches the query could
>>> return the first 5 rows?
>>>
>>>

>
>>> My use case is I have a table with 35 million rows with a geo index, and
>>> I want to do a KNN search but also limit the query on some other
>>> parameters. In some cases the other parameters restrict the query so much
>>> that Heap Fetches becomes several 100k or more, and in those cases I would
>>> like to have a limit to my query.
>>>
>>
>> Have you checked the TABLESAMPLE clause in v9.5?
>>
>> https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation
>>
>>
> Unless I misunderstand what you mean or how it works I cant really see
> what it would help.
>
>
I stand corrected. TABLESAMPLE will not help you.

> I want my query to still return the "best" results, and I want it to use
> the index for that. Just randomly selecting out from the whole table will
> either have to sample a too small subset of the rows, or be too slow.
>
> So, given my query above, in the normal ("slow" case) I would find the 10
> first even rows:
> 2,4,6,8,10,12,14,16,18,20
> If I could restrict the heap fetches to 10 I would find
> 2,4,6,8,10
> However, with tablesample I might end up with for example these rows:
> 15024,71914,51682,7110,61802,63390,98278,8022,34256,49220
>
>
How about using the LIMIT ?
SELECT column_1, column_2, ... FROM my_table WHERE <<expression>>
ORDER BY my_column
LIMIT 10 ;

> In my use case I want the best rows (according to the order by), so just
> a random sample is not good enough.
>
> /Victor
>
>
>>
>>> Thanks!
>>> /Victor
>>>
>> --
>> --
>> Best Regards
>> Sameer Kumar | DB Solution Architect
>> *ASHNIK PTE. LTD.*
>>
>> 101 Cecil Street, #11-11 Tong Eng Building, Singapore 069 533
>>
>> T: +65 6438 3504 | M: +65 8110 0350
>>
>> Skype: sameer.ashnik | www.ashnik.com
>>
> --
--
Best Regards
Sameer Kumar | DB Solution Architect
*ASHNIK PTE. LTD.*

101 Cecil Street, #11-11 Tong Eng Building, Singapore 069 533

T: +65 6438 3504 | M: +65 8110 0350

Skype: sameer.ashnik | www.ashnik.com

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Sameer Kumar 2016-08-19 09:21:43 Re: PG vs ElasticSearch for Logs
Previous Message Thomas Güttler 2016-08-19 08:57:51 Re: PG vs ElasticSearch for Logs