query plan not optimal

From: Marc Cousin <cousinmarc(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: query plan not optimal
Date: 2013-12-19 15:33:24
Message-ID: 52B311C4.1070108@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi,

I'm having something I feel is a bit of a limitation of the optimizer (or something I don't understand :) ).

Sorry, this is a rather long mail.

I have a workaround for the problem below, but I don't like cheating the optimizer for no good reason.

First a little bit of context, because the example below may feel dumb if you read it with no explanation:

I'm working on the bacula project, trying to further optimize data insertion into the database (we've been working on this subject for a few years now, as backup speed is a core functionality of any backup software).

Bacula is a backup software: we are storing all of the backup's metadata into the database. It means all backed up files…

The two very large tables are file and path:

* file contains metadata about ALL files that have been backed up and are still available in the database. It sometimes contain several billions records.

* path contains the full path of a file. It is stored out of file to save (lots) of space, as most path are identical in files. It still usually contains 20 to 50 million records, and can be as big as 20GB.

So we have something like this:

create table file (fileid bigserial, filename text, pathid bigint);
alter table file add primary key (fileid);

create table path (pathid bigserial, path text);
alter table path add primary key (pathid);
create unique index idx_path on path(path);

I removed some columns from the file table: we store stat(), a checksum, etc. They're not needed for this example.

In order to insert data efficiently, backups first store data into temp tables which look like this:

create table batch (path text, filename text);

(once again I removed all metadata columns)

If you want to create a batch table with useful data to replicate what I'm going to show, you can try something like this:

find /home -printf '%h\t%f\n' | psql -c "COPY batch FROM STDIN" bacula

We analyze the batch table: in the real code, it is a temp table, so it is compulsory:

analyze batch;

Then we insert missing paths. This is one of the plans that fail, but we'll focus on the second one: as we are starting from an empty path table, this query won't be realistic.

insert into path (path) select path from batch where not exists (select 1 from path where path.path=batch.path) group by path;

We analyze:

analyze path;

So now we insert into the file table.

insert into file (pathid,filename) select pathid, filename from batch join path using (path);

Here is the plan:

bacula=# explain select pathid, filename from batch join path using (path);
QUERY PLAN
-----------------------------------------------------------------------
Hash Join (cost=822.25..22129.85 rows=479020 width=26)
Hash Cond: (batch.path = path.path)
-> Seq Scan on batch (cost=0.00..11727.20 rows=479020 width=85)
-> Hash (cost=539.89..539.89 rows=22589 width=81)
-> Seq Scan on path (cost=0.00..539.89 rows=22589 width=81)
(5 lignes)

I have this plan almost all the time. Lets add a bunch of useless records in path to be more realistic and make things worse: usually, bacula inserts data in 500,000 batches, and path is very large (millions of records), and is bigger than work_mem (it would have to be 20GB in most extreme cases), so it may trash disks heavily (many servers can be inserting at the same time).

Let's simulate this (remove indexes before, put them back afterwards :) )

insert into path (path) select generate_series(1000000,50000000); #Create a realistic path table
analyze path;

Here is the plan:

bacula=# explain (analyze,buffers,verbose) select pathid, filename from batch join path using (path);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=1646119.66..1945643.59 rows=479020 width=26) (actual time=27275.240..36745.904 rows=479020 loops=1)
Output: path.pathid, batch.filename
Hash Cond: (batch.path = path.path)
Buffers: shared hit=130760 read=179917 written=1823, temp read=224130 written=223876
-> Seq Scan on public.batch (cost=0.00..11727.20 rows=479020 width=85) (actual time=0.259..176.031 rows=479020 loops=1)
Output: batch.filename, batch.path
Buffers: shared read=6937 written=1823
-> Hash (cost=793966.96..793966.96 rows=49022696 width=16) (actual time=27274.725..27274.725 rows=49022590 loops=1)
Output: path.pathid, path.path
Buckets: 131072 Batches: 128 Memory Usage: 18329kB
Buffers: shared hit=130760 read=172980, temp written=218711
-> Seq Scan on public.path (cost=0.00..793966.96 rows=49022696 width=16) (actual time=0.231..9650.452 rows=49022590 loops=1)
Output: path.pathid, path.path
Buffers: shared hit=130760 read=172980
Total runtime: 36781.919 ms
(15 lignes)

It seems like a logical choice (and it's working quite well, but we only have 1 of these running, and my path table is still very small compared to real use cases)

Let's force a nested loop (we don't do it that way usually, we lower the *_page_cost)

set enable_hashjoin TO off; set enable_mergejoin TO off;

bacula=# explain (analyze,buffers,verbose) select pathid, filename from batch join path using (path);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.56..4001768.10 rows=479020 width=26) (actual time=2.303..15371.237 rows=479020 loops=1)
Output: path.pathid, batch.filename
Buffers: shared hit=2403958 read=7539
-> Seq Scan on public.batch (cost=0.00..11727.20 rows=479020 width=85) (actual time=0.340..160.142 rows=479020 loops=1)
Output: batch.path, batch.filename
Buffers: shared read=6937
-> Index Scan using idx_path on public.path (cost=0.56..8.32 rows=1 width=16) (actual time=0.030..0.031 rows=1 loops=479020)
Output: path.pathid, path.path
Index Cond: (path.path = batch.path)
Buffers: shared hit=2403958 read=602
Total runtime: 15439.043 ms

As you can see, more than twice as fast, and a very high hit ratio on the path table, even if we start from a cold cache (I did, here, both PostgreSQL and OS). We have an excellent hit ratio because the batch table contains few different path (several files in a directory), and is already quite clustered, as it comes from a backup, which is of course performed directory by directory.

What I think is the cause of the problem is that the planner doesn't take into account that we are going to fetch the exact same values all the time in the path table, so we'll have a very good hit ratio. Maybe the n_distinct from batch.path could be used to refine the caching effect on the index scan ? The correlation is almost 0, but that's normal, the directories aren't sorted alphabetically, but all records for a given path are grouped together.

For now, we work around this by using very low values for seq_page_cost and random_page_cost for these 2 queries. I just feel that maybe PostgreSQL could do a bit better here, so I wanted to submit this use case for discussion.

Regards

Marc

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Dave Johansen 2013-12-19 16:53:35 Re: Recommendations for partitioning?
Previous Message Tom Lane 2013-12-19 14:48:55 Re: query not using index