Re: Disk buffering of resultsets

From: Vitalii Tymchyshyn <vit(at)tym(dot)im>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, John R Pierce <pierce(at)hogranch(dot)com>, PG-JDBC Mailing List <pgsql-jdbc(at)postgresql(dot)org>
Subject: Re: Disk buffering of resultsets
Date: 2014-10-13 15:33:46
Message-ID: CABWW-d2j6PapSxo29dsG2CP2Somsh8cUjcOX1eXdG=NH7NC0hQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-jdbc

Hello, again.

Sorry for the pause, I had a really busy week. Yet it allowed me to think a
little more.
As for me, there are three independent goals that can be addressed
independently:

1) Prevent OOMs
Unfortunately this can be addressed with out of heap saving only. The way I
did in draft would still OOM when secondary query comes.
Note that it's not that unusual. It's usually used without any
multithreading to perform a client-side join, e.g. when complicated
inheritance scenario is in place or to load some dictionary data without
much duplication (e.g. only few wide dictionary entries for the long
query), ...
I am still thinking to do it without much parsing (you will need record
type and size and that's all, without field parsing) by simply copying
as-is to temp file. Pluggable interfaces can be done later if needed.

2) Fast first record
Here we need to introduce strategies for "who is doing copying and when"
from (1). I propose pluggable strategies with few predefined (see below).
User can pass predefined strategy name or an Executor as a DataSource
parameter or static method reference that returns an Executor when a string
is needed (e.g. in connection URI). This would also allow to easily point
to Executors.* methods. We may think about ScheduledExecutor requirement to
also reuse it for QueryTimeout stuff.

I propose to have next predefined strategies:
a) Direct executor, that does all loading at the very beginning,
potentially saving to a temp file.
b) Postponed executor, that works much like in my draft: reads as needed
without any disk saving. Performs disk saving only when connection is
needed for some other statement.
c) JVM-wide Executors.newCachedThreadPool that will start offloading in
parallel as fetchSize is reached.

Also I'd propose to set default fetchSize to some reasonable value, like
1000 and specify one of the strategies (e.g (a)) as default so that we
won't get OOM on default settings. Or we should allow to set default fetch
size on connection/data source level (or both).

3) Fast cancel/resultset close.
It's the only place where switching to portals is needed as far as I can
see and it can be done orthogonal to (1) and (2). I don't see any other
goal that will benefit from it. To be honest, I am willing to do (1) and
(2), but not (3) because this would mean me to get much deeper into the
protocol I know almost nothing about right now.

Best regards, Vitalii Tymchyshyn.

In response to

Responses

Browse pgsql-jdbc by date

  From Date Subject
Next Message Enrico Olivelli - Diennea 2014-10-14 08:09:23 R: Disk buffering of resultsets
Previous Message Craig Ringer 2014-10-10 09:02:51 Adding support for batches that return generated keys