From: | Jeff Davis <pgsql(at)j-davis(dot)com> |
---|---|
To: | Heikki Linnakangas <heikki(at)enterprisedb(dot)com> |
Cc: | Simon Riggs <simon(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Sequential scans |
Date: | 2007-05-02 17:52:18 |
Message-ID: | 1178128338.28383.154.camel@dogma.v10.wvs |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, 2007-05-02 at 14:26 +0100, Heikki Linnakangas wrote:
> Hi,
>
> I'm starting to review the "synchronized scans" and "scan-resistant
> buffer cache" patches. The patches have complex interactions so I'm
> taking a holistic approach.
>
> There's four outstanding issues with the sync scans in particular:
>
> 1. The simplistic hash approach. While it's nice to not have a lock, I'm
> worried of collisions. If you had a collision every now and then, it
> wouldn't be that bad, but because the hash value is computed from the
> oid, a collision would be persistent. If you create a database and
> happen to have two frequently seqscanned tables that collide, the only
> way to get rid of the collision is to drop and recreate a table.
> Granted, that'd probably be very rare in practice, but when it happens
> it would be next to impossible to figure out what's going on.
>
> Let's use a normal hash table instead, and use a lock to protect it. If
> we only update it every 10 pages or so, the overhead should be
> negligible. To further reduce contention, we could modify ReadBuffer to
> let the caller know if the read resulted in a physical read or not, and
> only update the entry when a page is physically read in. That way all
> the synchronized scanners wouldn't be updating the same value, just the
> one performing the I/O. And while we're at it, let's use the full
> relfilenode instead of just the table oid in the hash.
What should be the maximum size of this hash table? Is there already-
existing hash table code that I should use to be consistent with the
rest of the code?
I'm still trying to understand the effect of using the full relfilenode.
Do you mean using the entire relation _segment_ as the key? That doesn't
make sense to me. Or do you just mean using the relfilenode (without the
segment) as the key?
> 3. By having different backends doing the reads, are we destroying OS
> readahead as Tom suggested? I remember you performed some tests on that,
> and it was a problem on some systems but not on others. This needs some
> thought, there may be some simple way to address that.
Linux with CFQ I/O scheduler performs very poorly and inconsistently
with concurrent sequential scans regardless of whether the scans are
synchronized or not. I suspect the reason for this is that CFQ is
designed to care more about the process issuing the request than any
other factor.
Every other I/O system performed either ideally (no interference between
scans) or had some interference but still much better than current
behavior.
Of course, my tests are limited and there are many possible combinations
of I/O systems that I did not try.
> 4. It fails regression tests. You get an assertion failure on the portal
> test. I believe that changing the direction of a scan isn't handled
> properly; it's probably pretty easy to fix.
>
I will examine the code more carefully. As a first guess, is it possible
that test is failing because of the non-deterministic order in which
tuples are returned?
> Jeff, could you please fix 1 and 4? I'll give 2 and 3 some more thought,
> and take a closer look at the scan-resistant scans patch. Any comments
> and ideas are welcome, of course..
>
Yes. I'll also try to address the other issues in your email. Thanks for
your comments.
Regards,
Jeff Davis
From | Date | Subject | |
---|---|---|---|
Next Message | Gregory Stark | 2007-05-02 17:54:43 | Re: Sequential scans |
Previous Message | Gregory Stark | 2007-05-02 17:44:12 | Re: strange buildfarm failures |