From: | Faheem Mitha <faheem(at)email(dot)unc(dot)edu> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | pgsql-performance(at)postgresql(dot)org |
Subject: | Re: experiments in query optimization |
Date: | 2010-03-29 20:22:05 |
Message-ID: | alpine.DEB.2.00.1003300141290.13883@orwell.homelinux.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
On Mon, 29 Mar 2010, Robert Haas wrote:
> On Mon, Mar 29, 2010 at 2:31 PM, Faheem Mitha <faheem(at)email(dot)unc(dot)edu> wrote:
>>> It's not really too clear to me from reading this what specific
>>> questions you're trying to answer.
>>
>> Quote from opt.{tex/pdf}, Section 1:
>>
>> "If I have to I can use Section~\ref{ped_hybrid} and
>> Section~\ref{tped_hybrid}, but I am left wondering why I get the performance
>> I do out of the earlier versions. Specifically, why is
>> Section~\ref{ped_bigjoin} so much slower than Section~\ref{ped_trunc}, and
>> why does the memory usage in Section~\ref{ped_phenoout} blow up relative to
>> Section~\ref{ped_bigjoin} and Section~\ref{ped_trunc}?"
>
> Here and in the document, you refer to section numbers for the
> "hybrid" version but I don't see where you define what the "hybrid"
> version actually is.
It is defined later in the file. I don't know if you are looking at the
pdf, but if so, it is Section 2.4 (for the hybrid PED query). In the text
file, I guess the easist way would be to grep for the label ped_hybrid.
> And the differences between your queries are not real clear either -
> first you say you took out pheno and sex because they weren't necessary,
> but then you decide to put them back. I don't know what that means.
> If they're not necessary, leave them out.
I don't see where I say that pheno and sex weren't necessary. In fact, the
word 'necessary' does not appear in the opt document. I took them out to
see how it would affect performance. Which is does, dramatically. I say
"So, I decided to remove the joins to tables corresponding to the patient
data, namely pheno and sex, and the runtime dropped to 150 min, while the
memory stayed around 5G."
Maybe I wasn't being sufficiently explicit here. Perhaps
"So, I decided to remove the joins to tables corresponding to the patient
data, namely pheno and sex, to see how it would affect performance..."
would have been better.
>>> One random thought: WHERE row_number() = 1 is not too efficient.
>>> Try using LIMIT or DISTINCT ON instead.
>>
>> Possibly. However, the CTE that uses
>>
>> WHERE row_number() = 1
>>
>> doesn't dominate the runtime or memory usage, so I'm not too concerned
>> about it.
>
> Hmm, you might be right.
>
>>> If you're concerned about memory usage, try reducing work_mem; you've
>>> probably got it set to something huge.
>>
>> work_mem = 1 GB (see diag.{tex/pdf}).
>>
>> The point isn't that I'm using so much memory. Again, my question is, why
>> are these changes affecting memory usage so drastically?
>
> Well each sort or hash can use an amount of memory that is limited
> from above by work_mem. So if you write the query in a way that
> involves more sorts or hashes, each one can add up to 1GB to your
> memory usage, plus overhead. However, it doesn't look like any of
> your queries including 30 sorts or hashes, so I'm thinking that the
> RSS number probably also includes some of the shared memory that has
> been mapped into each backend's address space. RSS is not a terribly
> reliable number when dealing with shared memory; it's hard to say what
> that really means.
>>> You might need to create some indices, too.
>> Ok. To what purpose? This query picks up everything from the tables and the
>> planner does table scans, so conventional wisdom and indeed my experience,
>> says that indexes are not going to be so useful.
> Well, a hash join is not usually the first thing that pops to mind when
> dealing with a table that has 825 million rows (geno). I don't know if
> a nested loop with inner-indexscan would be faster, but it would almost
> certainly use less memory.
Can you provide an illustration of what you mean? I don't know what a
"nested loop with inner-indexscan" is in this context.
Regards, Faheem.
From | Date | Subject | |
---|---|---|---|
Next Message | Josh Berkus | 2010-03-30 06:18:27 | Re: why does swap not recover? |
Previous Message | Robert Haas | 2010-03-29 19:55:46 | Re: experiments in query optimization |