Quick Links

Re: Large DB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Manfred Koizar <mkoi-pg(at)aon(dot)at>
Cc:	"Mooney, Ryan" <ryan(dot)mooney(at)pnl(dot)gov>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Large DB
Date:	2004-04-02 16:16:21
Message-ID:	28229.1080922581@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general pgsql-hackers

Manfred Koizar <mkoi-pg(at)aon(dot)at> writes:
> The first step, however, (acquire_sample_rows() in analyze.c) has to
> read more rows than finally end up in the sample. It visits less than
> O(nblocks) pages but certainly more than O(1).

> A vague feeling tries to tell me that the number of page reads is
> somehow related to the harmonic numbers 1 + 1/2 + 1/3 + ... + 1/n, which
> grow like O(ln(n)).

Good guess. Vitter's paper says the expected time to sample n rows from
a table of size N is O(n * (1 + log(N/n))).

> I have an idea how this could be done with O(1) page reads.

The hard part is getting a genuinely random sample when we don't know N
in advance. We do however know the table size in blocks, so if you're
willing to make assumptions about constant tuple density you could do
something different. (But the tuple density assumption is exactly the
weak spot of what we've got, so I'm unconvinced that would be a big step
forward.)

regards, tom lane

In response to

Re: Large DB at 2004-04-02 15:17:50 from Manfred Koizar

Responses

Re: Large DB at 2004-04-02 18:05:01 from Manfred Koizar

Browse pgsql-general by date

	From	Date	Subject
Next Message	tatyana.krasnokutsky	2004-04-02 16:26:19	tcl and pgrd problem
Previous Message	Bruno Wolff III	2004-04-02 15:44:58	Re: row-level security model

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Josh Berkus	2004-04-02 16:47:09	Re: PITR for replication?
Previous Message	Joe Conway	2004-04-02 16:16:03	Re: Inconsistent behavior on Array & Is Null?