Re: query not scaling

From: Rob Sargent <robjsargent(at)gmail(dot)com>
To: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: query not scaling
Date: 2017-10-31 16:16:14
Message-ID: b605e78f-477e-879b-5a49-182d74503f2c@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 10/31/2017 03:12 AM, Laurenz Albe wrote:
> Rob Sargent wrote:
>>> I think your biggest problem is the join condition
>>> on m.basepos between s.startbase and s.endbase
>>>
>>> That forces a nested loop join, which cannot be performed efficiently.
>> Agree! 800,000 * 4,000 = 3,200,000,000. It's just that I thought I had
>> corralled that problem which indexing but apparently not. I was hoping
>> some kind soul might point out a missing index or similar correction. I
>> have completely reworked the process, but not sure yet if it's correct.
>> (The slow answer is correct, once it comes in.)
> You can create indexes that are useful for this query:
>
> ON sui.segment(chrom, markerset_id)
> ON sui.probandset(people_id)
>
> But that probably won't make a big difference, because the sequential
> scans take only a small fraction of your query time.
>
> A little less than half of the query time is spent in the nested loop
> join, and a little more than half of the time is spent doing the
> GROUP BY.
>
> Perhaps the biggest improvement you can easily make would be to
> get rid of "numeric" for the computation. I suspect that this is
> where a lot of time is spent, since the hash aggregate is over
> less than 15000 rows.
>
> Unless you really need the precision of "numeric", try
>
> CREATE OR REPLACE FUNCTION pv(l bigint, e bigint, g bigint, o int)
> RETURNS double precision LANGUAGE sql AS
> $$SELECT (g+e+o)::double precision / (l+e+g+o)::double precision$$;
>
> Yours,
> Laurenz Albe
In practice markersets are always aligned with one chromosome so I would
not expect this to have an effect. There's no constraint on this
however, and there can be more than one markerset per chromosome. I
have played with indexing on segment.markerset_id.

In all the data sets used in the examples (runtimes, explains etc) there
has been a in single people_id across the existing segment data. Down
the road this of course will not be the case and I can see the value of
an index on probandset.people_id eventually. I can certainly add it now
for a test. I'm currently writing a probandset loader hoping to get a
test case for the problem with gin indexing mentioned up-thread.

I think I'm most surprise at the notion that the arithmetic is the
problem and will happily test your suggestion to force floating point
values. The value can get small (10^-12 on a good day!) but we don't
need many digits of precision.

Thanks

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Stephen Froehlich 2017-10-31 20:36:37 PG 10 - Trouble with overlap of range partition of two dimensions
Previous Message Stephen Frost 2017-10-31 12:12:12 Re: Roles inherited from a role which is the owner of a database can drop it?