Luca Ferrari wrote:
> Hello,
> I've got a doubt about partial indexes and the path chosen by the optimizer.
> Consider this simple scenario:
>
> CREATE TABLE p( pk serial NOT NULL , val2 text, val1 text, b boolean, PRIMARY
> KEY (pk) );
> INSERT INTO p(pk, val1, val2, b) VALUES( generate_series(1,1000000), 'val1b',
> 'val2b', true );
> INSERT INTO p(pk, val1, val2, b) VALUES( generate_series(1000001,2000000),
> 'val1Notb', 'val2Notb', false );
> CREATE INDEX i_p_b ON p (b) WHERE b = true;
> ANALYZE p;
>
> So I create a table with 2-million rows, the first million with b = true and
> the second one with b = false.
> Now doing an explain for a query that selects only on the b attribute I got:
>
> EXPLAIN SELECT * FROM p WHERE b = false;
> QUERY PLAN
> ------------------------------------------------------------
> Seq Scan on p (cost=0.00..34706.00 rows=1000133 width=28)
> Filter: (NOT b)
>
>
> So a sequential scan. I know that the optimizer will not consider an index if
> it is not filtering, but I don't understand exactly why in this case. In fact,
> considering that the above query could remove the first half data pages (where
> b = true), and considering that:
>
> SELECT reltype, relval1, relpages, reltuples
> FROM pg_class WHERE relval1 IN ('p', 'i_p_b');
> reltype | relval1 | relpages | reltuples
> ---------+----------+----------+-----------
> 615079 | p | 14706 | 2e+06
> 0 | i_p_b | 2745 | 999867
>
> The sequential access requires 14706 pages, while using the index for filtering
> almost the half of those, we've got 2745 + 7353 = around 10000 pages.
> I've tried to change the index type to an hash, but the situation did not
> change. Even with enable_seqscan = off the above query is executed
> sequentially, but with a different initial cost:
>
>
> EXPLAIN SELECT * FROM p WHERE b = false;
> QUERY PLAN
> ----------------------------------------------------------------------------
> Seq Scan on p (cost=10000000000.00..10000034706.00 rows=1000133 width=28)
> Filter: (NOT b)
>
>
> And here comes the second doubt: since in both cases the planner is doing a
> sequential access, why the first case has an initial cost = 0 and this one has
> a cost of 1 million?
> I'm getting lost here, I need some hint to understand what is happening.
>
> I'm running
> PostgreSQL 9.0.2 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2
> 20080704 (Red Hat 4.1.2-46), 64-bit
>
>
> Thanks,
> Luca
>
>
I don't see where you've told the engine anything about the "b" column?
Naturally a boolean can only have two values, but you haven't indexed on
column b, nor partitioned the table on the values in the "b" column. I
don't think it can just _know_ that the first n records have b=true.
What behaviour would you expect if you had alternated the inserts (or
just alternated the primary keys by assigning them even and odd values
in the inserts)?