Re: Raw devices vs. Filesystems

From: "Murthy Kambhampaty" <murthy(dot)kambhampaty(at)goeci(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Chris Browne" <cbbrowne(at)acm(dot)org>
Cc: <pgsql-admin(at)postgresql(dot)org>
Subject: Re: Raw devices vs. Filesystems
Date: 2004-04-07 16:46:16
Message-ID: 3554BFA5FAE4B14FB5459D36E4F4E36203D956@io.goeci.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

On Wednesday, April 07, 2004 1:26 AM Tom Lane wrote:

>
> But to get back to the point of this discussion: to allow PG
> to use raw devices instead of filesystems, we'd first have to do a ton of
> portability work
...

[The following is said in a low, tentative voice :) ]

I wonder if writing the postgresql data structures as HDF5 data structures (http://hdf.ncsa.uiuc.edu/whatishdf5.html) within a single HDF5 file (perhaps the WAL files would still reside elsewhere) would improve performance while allowing HDF5 to handle portability, and other useful features, is a better solution than the relying on filesystem features.

HDF5 actually provides an added portability advantage that postgresql does not currently enjoy:
"a completely portable file format, so that a file can be written on any system and read on any other"
(See http://hdf.ncsa.uiuc.edu/HDF5/RD100-2002/All_About_HDF5.pdf)
The HDF5 "distribution" includes tools for dumping data structures, etc. so if you're hooked on filesystem level operations, you have the ability to inspect postgresql data structures within the HDF5 file, i.e., "outside postgresql".

HDF5's is also designed for clustered/grid computing systems:
"The HDF5 format and library provide a powerful means of organizing and accessing data in a manner that allows scientists to share, process, and manipulate data in today's heterogeneous and quickly-evolving high-performance computational environment, including the emerging computational GRIDs." (http://hdf.ncsa.uiuc.edu/HDF5/RD100-2002/All_About_HDF5.pdf, p. 3).
So, the main purpose of this post is to suggest that HDF5's design moves a postgresql version built on a HDF5 datastore that much closer to being ready for cluster-computing environments, with respect to the datastore (there's still the shared memory, etc., that need to be addressed, but ...).

We're playing with HDF5 from Python (see the pytables project) for our "analytics" work, but that requires moving data out of postgresql. I suspect that an SQL interface to HDF5 data structures using postgresql would be a lot more convenient, and that postgresql would gain multiple benefits from having all its data structures in a single HDF5 file. OTOH, maybe us analytics types are better off with Python over HDF5 and "postgresql on HDF5" is not a net win for postgresql. Still, there seems to a great advantage to having rich data structures to operate on rather than just "files", and allowing the HDF5 library to deal with portability, I/O efficiency, and clustering.

Hope my $0.02 worth was.

Cheers,
Murthy

Browse pgsql-admin by date

  From Date Subject
Next Message Peter Eisentraut 2004-04-07 17:12:51 Re: installation problem
Previous Message Steve Atkins 2004-04-07 16:29:47 Re: [PERFORM] Raw devices vs. Filesystems