From: | Simon Riggs <simon(at)2ndQuadrant(dot)com> |
---|---|
To: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-students(at)postgresql(dot)org |
Subject: | Slicing TOAST |
Date: | 2013-05-14 07:05:03 |
Message-ID: | CA+U5nMJGgJNt5VXqkR=crtDqXFmuyzwEF23-fD5NuSns+6N5dA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers pgsql-students |
I'm proposing this now as a possible GSoC project:
In 1-byte character encodings (i.e. not UTF-8), SUBSTR() is optimised
to allow seeking straight to the exact slice when retrieving a large
toasted value. This reduces I/O considerably when you have large
toasted values since it is an O(1) action rather than an O(N).
This is possible because the slicing of toasted values is predictable
on 1 byte encodings.
It would be useful to have a predictable function perform the slicing,
so we could use that knowledge later to optimise searches in a wider
range of situations. More specifically, since UTF-8 is so common, to
allow optimisations in that encoding of common data: text, XML, JSON.
e.g. if we knew that an XML document has a required element called
TITLE and that occurs only once and always in the first slice, it
would be useful information to use in search functions. (Not sure, but
it may be possible to assign non-consecutive slice numbers to allow
variable data mid-way through a column value if needed).
e.g. in UTF-8 free text we could put 500 characters in each slice, so
that even if that could be anywhere between 500 and 2000 bytes it
would still fit just fine.
e.g. for arrays, if we put say 200 elements per slice, then accessing
particular elements would require only 1 slice retrieval.
Doing this would *possibly* reduce packing density, but not certainly
so. But it would greatly improve access times to large structured
toast values.
Implementation would be to have a slicing function that gets called
iteratively on a column value until it returns no further slices.
There is no proposal for search functions. It would be up to the
search function to confirm the details of the slicing function prior
to using that knowledge in a search. We'd need a way to check that
the function inputs matched the slicing of the column, so we'd need to
have some requirement for input on the function to be matched against
metadata on the column. So presumably some decoration of the input
parameters of functions, which sounds like a little too much
difficulty.
So the proposal would be to provide the slicing/chunking function at
the datatype level not the column level. The user would create a
binary compatible type, that is effectively XML or whatever, just with
extra constraints on usage for slicing.
But now I consider the syntax, I'll call it a splitter function since
slicer, chunker sound silly to me.
CREATE TYPE my_xml
LIKE xml
SPLITTER my_toast_function;
Search functions would then be designed that take such datatypes as
input and would be able to rely with certainty upon the toast slicing
algorithms in order to retrieve data.
Doing it this way means that different XML or JSON schemas could have
specific search functions optimised for them.
I'm proposing this now as a possible GSoC project; I don't propose to
actively work on it myself.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Dimitri Fontaine | 2013-05-14 07:29:38 | Re: erroneous restore into pg_catalog schema |
Previous Message | Michael Paquier | 2013-05-14 04:51:42 | Re: Parallel Sort |
From | Date | Subject | |
---|---|---|---|
Next Message | Heikki Linnakangas | 2013-05-14 07:06:07 | Re: GSoC project: K-medoids clustering in Madlib |
Previous Message | Atri Sharma | 2013-05-14 06:42:56 | Re: GSoC project: K-medoids clustering in Madlib |