Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance

From: Jan Kara <jack(at)suse(dot)cz>
To: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
Cc: Dave Chinner <david(at)fromorbit(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Trond Myklebust <trondmy(at)gmail(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Joshua Drake <jd(at)commandprompt(dot)com>, James Bottomley <James(dot)Bottomley(at)HansenPartnership(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Mel Gorman <mgorman(at)suse(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net>
Subject: Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Date: 2014-01-14 10:00:40
Message-ID: 20140114100040.GB21327@quack.suse.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue 14-01-14 09:08:40, Hannu Krosing wrote:
> >>> Effectively you end up with buffered read/write that's also mapped into
> >>> the page cache. It's a pretty awful way to hack around mmap.
> >> Well, the problem is that you can't really use mmap() for the things we
> >> do. Postgres' durability works by guaranteeing that our journal entries
> >> (called WAL := Write Ahead Log) are written & synced to disk before the
> >> corresponding entries of tables and indexes reach the disk. That also
> >> allows to group together many random-writes into a few contiguous writes
> >> fdatasync()ed at once. Only during a checkpointing phase the big bulk of
> >> the data is then (slowly, in the background) synced to disk.
> > Which is the exact algorithm most journalling filesystems use for
> > ensuring durability of their metadata updates. Indeed, here's an
> > interesting piece of architecture that you might like to consider:
> >
> > * Neither XFS and BTRFS use the kernel page cache to back their
> > metadata transaction engines.
> But file system code is supposed to know much more about the
> underlying disk than a mere application program like postgresql.
>
> We do not want to start duplicating OS if we can avoid it.
>
> What we would like is to have a way to tell the kernel
>
> 1) "here is the modified copy of file page, it is now safe to write
> it back" - the current 'lazy' write
>
> 2) "here is the page, write it back now, before returning success
> to me" - unbuffered write or write + sync
>
> but we also would like to have
>
> 3) "here is the page as it is currently on disk, I may need it soon,
> so keep it together with your other clean pages accessed at time X"
> - this is the non-dirtying write discussed
>
> the page may be in buffer cache, in which case just update its LRU
> position (to either current time or time provided by postgresql), or
> it may not be there, in which case put it there if reasonable by it's
> LRU position.
>
> And we would like all this to work together with other current linux
> kernel goodness of managing the whole disk-side interaction of
> efficient reading and writing and managing the buffers :)
So when I was speaking about the proposed vrange() syscall in this thread,
I thought that instead of injecting pages into pagecache for aging as you
describe in 3), you would mark pages as volatile (i.e. for reclaim by
kernel) through vrange() syscall. Next time you need the page, you check
whether the kernel reclaimed the page or not. If yes, you reload it from
disk, if not, you unmark it and use it.

Now the aging of pages marked as volatile as it is currently implemented
needn't be perfect for your needs but you still have time to influence what
gets implemented... Actually developers of the vrange() syscall were
specifically looking for some ideas what to base aging on. Currently I
think it is first marked - first evicted.

Honza
--
Jan Kara <jack(at)suse(dot)cz>
SUSE Labs, CR

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2014-01-14 10:06:24 Re: [PATCH] Negative Transition Aggregate Functions (WIP)
Previous Message Oleg Bartunov 2014-01-14 09:57:13 Re: nested hstore patch