From: | Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com> |
---|---|
To: | Jim Nasby <jim(at)nasby(dot)net> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-performance(at)postgresql(dot)org |
Subject: | Re: reducing random_page_cost from 4 to 2 to force index scan |
Date: | 2011-05-19 22:27:35 |
Message-ID: | BANLkTikiQmj1z_VtdcGxw6_v-tdnPuwSug@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
2011/5/19 Jim Nasby <jim(at)nasby(dot)net>:
> On May 19, 2011, at 9:53 AM, Robert Haas wrote:
>> On Wed, May 18, 2011 at 11:00 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
>>> Jim Nasby wrote:
>>>> I think the challenge there would be how to define the scope of the
>>>> hot-spot. Is it the last X pages? Last X serial values? Something like
>>>> correlation?
>>>>
>>>> Hmm... it would be interesting if we had average relation access times for
>>>> each stats bucket on a per-column basis; that would give the planner a
>>>> better idea of how much IO overhead there would be for a given WHERE clause
>>>
>>> You've already given one reasonable first answer to your question here. If
>>> you defined a usage counter for each histogram bucket, and incremented that
>>> each time something from it was touched, that could lead to a very rough way
>>> to determine access distribution. Compute a ratio of the counts in those
>>> buckets, then have an estimate of the total cached percentage; multiplying
>>> the two will give you an idea how much of that specific bucket might be in
>>> memory. It's not perfect, and you need to incorporate some sort of aging
>>> method to it (probably weighted average based), but the basic idea could
>>> work.
>>
>> Maybe I'm missing something here, but it seems like that would be
>> nightmarishly slow. Every time you read a tuple, you'd have to look
>> at every column of the tuple and determine which histogram bucket it
>> was in (or, presumably, which MCV it is, since those aren't included
>> in working out the histogram buckets). That seems like it would slow
>> down a sequential scan by at least 10x.
>
> You definitely couldn't do it real-time. But you might be able to copy the tuple somewhere and have a background process do the analysis.
>
> That said, it might be more productive to know what blocks are available in memory and use correlation to guesstimate whether a particular query will need hot or cold blocks. Or perhaps we create a different structure that lets you track the distribution of each column linearly through the table; something more sophisticated than just using correlation.... perhaps something like indicating which stats bucket was most prevalent in each block/range of blocks in a table. That information would allow you to estimate exactly what blocks in the table you're likely to need...
Those are very good ideas I would get in mind for vacuum/checkpoint
tasks: if you are able to know hot and cold data, then order it in the
segments of the relation. But making it work at the planner level
looks hard. I am not opposed to the idea, but no idea how to do it
right now.
> --
> Jim C. Nasby, Database Architect jim(at)nasby(dot)net
> 512.569.9461 (cell) http://jim.nasby.net
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>
--
Cédric Villemain 2ndQuadrant
http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
From | Date | Subject | |
---|---|---|---|
Next Message | Kevin Grittner | 2011-05-20 03:52:10 | Re: SORT performance - slow? |
Previous Message | Samuel Gendler | 2011-05-19 21:41:21 | Re: SORT performance - slow? |