Re: How to get good performance for very large lists/sets?

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Richard Frith-Macdonald <richard(dot)frith-macdonald(at)brainstorm(dot)co(dot)uk>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: How to get good performance for very large lists/sets?
Date: 2014-10-06 19:17:36
Message-ID: CAMkU=1zxm22e-CVP0BJVMCoNX1k=9gkstKN6WbKCjX8QhrPe9Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, Oct 6, 2014 at 1:02 AM, Richard Frith-Macdonald <
richard(dot)frith-macdonald(at)brainstorm(dot)co(dot)uk> wrote:

> I'm wondering if anyone can help with advice on how to manage large
> lists/sets of items in a postgresql database.
>
> I have a database which uses multiple lists of items roughly like this:
>
> CREATE TABLE List (
> ID SERIAL,
> Name VARCHAR ....
> );
>
> and a table containing individual entries in the lists:
>
> CREATE TABLE ListEntry (
> ListID INT, /* Reference the List table */
> ItemID INT /* References an Item table */
> ) ;
> CREATE UNIQUE INDEX ListEntryIDX ON ListEntry(ListID, ItemID);
>
> Now, there are thousands of lists, many with millions of entries, and
> items are added to and removed from lists in an unpredictable way (in
> response to our customer's actions, not under our control). Lists are also
> created by customer actions.
>

How is concurrency handled? If one person adds something to a list at the
same time as someone else removes something from the same list, is there
any interaction or do they just pass through each other cleanly? Are lists
ever rewritten in their entirety, or just incrementally have items added
and removed one at a time?

What is the distribution of the number of lists any given entry is in?

>
> Finding whether a particular item is in a particular list is reasonably
> fast, but when we need to do things like find all the items in list A but
> not list B things can get very slow (particularly when both lists contain
> millions of common items).
>

One possibility would be to have an Entry table where each one has an array
of lists which contain it, with a gin index.

Then "in A but not B" could use the gin index to visit each Item in A, and
then could immediately see if it was also in B without needing to do
additional index look ups. However, maintaining the gin index could easily
become a substantial bottleneck.

>
> I think that server won't use index-only scans because, even in cases
> where a particular list has not had any recent changes, the ListEntry table
> will almost always have had some change (for one of the other lists) since
> its last vacuum.
> Perhaps creating multiple ListEntry tables (one for each list) would allow
> better performance; but that would be thousands (possibly tens of
> thousands) of tables, and allowing new tables to be created by our clients
> might conflict with things like nightly backups.
>

Reasoning about this situation would be a lot easier if you gave example
query, with the explain (analyze, buffers), and a description of what plan
you want it to use instead. It sounds like you might want a single
index-only scan to supply both branches of a hash semi join simultaneously,
which I don't think is implemented even if the visibility map is in good
shape.

I don't think PostgreSQL will ever demote an index-only scan to a plain
index scan just because it thinks the index-only part won't be useful
enough. But it might switch to an entirely different plan which is not
eligible to use an index-only scan in the first place because, given the
reduced usefulness of the index only scan, the other one looks cheaper.

So what you could do is vacuum a quiescent test database (do it several
times just for good measure as sometimes on some versions the first pass
isn't enough to get the vm all set), and then run the query. If you now
get a index only scan, that means you might need to vacuum your production
database more aggressively.

Cheers,

Jeff

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Andy Colson 2014-10-06 19:47:12 Re: How to get good performance for very large lists/sets?
Previous Message Andy Colson 2014-10-06 19:14:38 Re: How to get good performance for very large lists/sets?