From: | Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> |
---|---|
To: | Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ivan Panchenko <ivan(at)xray(dot)sai(dot)msu(dot)ru>, Teodor Sigaev <teodor(at)stack(dot)net> |
Subject: | Bidirectional hard joins (fwd) |
Date: | 2002-04-04 12:17:41 |
Message-ID: | Pine.GSO.4.44.0204041507140.3910-100000@ra.sai.msu.su |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Tom,
I attached a message from my colleague and think it'd be interesting
to you. A short history: During developing of one project on
Windows platform, Teodor has discovered a pretty nice feature of Gigabase
(free embedded database by Konstantin Knizhnik,
http://www.geocities.com/kknizhnik/gigabase.html) which helps us a lot.
Ivan has wrote a proposal for implementing it in PostgreSQL.
Could you, please, comment the proposal.
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---------- Forwarded message ----------
Date: Wed, 3 Apr 2002 01:40:05 +0400 (MSD)
From: Ivan E. Panchenko <ivan(at)xray(dot)sai(dot)msu(dot)ru>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Teodor Sigaev <teodor(at)stack(dot)net>
Subject: Bidirectional hard joins
Hi,
Here we propose some essential improvement of postgreSQL functionality,
which may provide a great perfomance increase.
1. Problem
The fastest way to find and fetch a record from a table is to
perform a SELECT ... WHERE record.id = value.
Probably, an index scan would be performed for this SELECT.
Such index scan seems to be fast, but there are some cases where it may
appear too slow. The most evident case is the case of a sub-query, which
can arise as a result of a join or a nested select statement.
If it were possible to store direct references to database
records in the tables, joins could be implemented in much more effective
way.
2. Possible soultion
Creating a datatype which stores direct reference
to the record (i.e., physical location of the tuple) is only a part of the
solution.
When a record that is referenced is updated its physical location can be
changed, so the references to it should be updated. To make this
possible, the referenced record should remember all the references to
itself. Thus, we consider the direct tuple references as bidirectional
links, or "bidirectional hard joins".
These "hard joins" are in some sense similar to hard links in a
filesystem (in this analogy, classic joins are like symbolic links).
Philosophically, this means a convergence between indexes and tables: a
table behaives like an index for an other table.
Obviously, this is is a nonrelational feature, and it requires some
special SQL syntax. Below we provide some examples for clarification of
the use of the proposed feature.
3. Examples
CREATE JOIN paternity FROM man.children TO child.father ;
-- creates a field man.children containing a reference to the table
child, and a field father in the table child with a back reference.
INSERT INTO man VALUES ('Bob Scott');
INSERT INTO child VALUES ('Charles Scott');
LINK paternity WHERE (man.name = 'Bob Scott'),(child.name = 'Charles Scott');
-- Create a link betewen the two records.
INSERT INTO child VALUES ('Doug Scott');
LINK paternity (man.name = 'Bob Scott'),(child.name = 'Doug Scott');
SELECT child.name from child, man
WHERE paternity(man,child) AND man.name = 'Bob Scott';
-- Find all Bob's children
> Charles Scott
> Doug Scott
> 2 records seleted.
---------------------------------------------------------------
This syntax was thought of just for illustration and is not proposed to
implement (now?).
4. Performance
When direct joins are used in select statements, they can strongly
increase performance.
Let us examine the query plan of the request ("Find all Bob's
children") from the example above in the present day postgres.
create table man (id SERIAL,name text);
create table child (id SERIAL,name text, parent_id int4 references man(id));
.. populate the tables ... and create indexes...
explain select child.name from child, man
where child.parent_id = man.id
and man.name = 'Bob Scott';
Nested Loop
-> Index Scan using man_n on man
-> Index Scan using child_par on child
In a hypotetical postgres with hard joins it could be:
Nested Loop
-> Index Scan using man_n on man
-> Direct retrieval on child
I.e., the for each retrieved "man" record we retrieve the "child" records
directly using hard join. The real overhead for this operation should be
neglible in comparison with index scan.
Using the hard joins require some additional overhead in updates. In fact,
after updating the record which takes part in such join, the references
to this record in the other records should be also updated. This operation
is not essentially new for postgres as similar things are done with
indexes when an indexed record is updated. Hence, the overhead for updates
is not greater than the overhead for updating indexes.
5. Implementation and conclusion
Effective implementing of hard joins requires hard changes to postgres,
most serious of them probably in the executor, where a new method "fetch
record by reference" should be added in addition to "index scan" and "seq
scan". Also the optimizer should be taught to deal with this.
The update support is not so hard as it is similar to the updating of
indexes.
Though the implementation of such hard joins is really a complicated task,
the performance it brings should be tremendous, so we consider discussing
this important.
From | Date | Subject | |
---|---|---|---|
Next Message | Bruce Momjian | 2002-04-04 14:23:39 | Re: timeout implementation issues |
Previous Message | Vince Vielhaber | 2002-04-04 10:25:10 | Re: Contrib update |