Re: NTA access on Solaris

From: Sherry Moore <sherry(dot)moore(at)sun(dot)com>
To: Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, LLonergan(at)greenplum(dot)com, markir(at)paradise(dot)net(dot)nz, pavan(at)enterprisedb(dot)com, swm(at)alcove(dot)com(dot)au, pgsql-hackers(at)postgresql(dot)org, drady(at)greenplum(dot)com
Subject: Re: NTA access on Solaris
Date: 2007-03-06 06:50:38
Message-ID: 20070306065038.GA264296@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On a 1P system system with 512K L2, it is more obvious why we shouldn't
bypass L2 for small reads:

The same readtest as my previous mail invoked as following:

./readtest -s working-set-size -f /platform/i86pc/boot_archive -n 100

With copyout_max_cached being 128K:

Working
set 16K 32K 64K 128K 256K 512K 1M 2M 128M
================================================================================
Seconds 4.2 4.0 4.1 4.1 5.7 7.0 7.1 7.0 7.5

With copyout_max_cached being 8K:

Working
set 16K 32K 64K 128K 256K 512K 1M 2M 128M
================================================================================
Seconds 4.8 4.8 4.9 4.9 5.0 5.0 5.0 5.0 5.1

Sherry

On Mon, Mar 05, 2007 at 09:41:14PM -0800, Sherry Moore wrote:
> ----- Forwarded message from Sherry Moore <sherry(dot)moore(at)sun(dot)com> -----
>
> Date: Mon, 5 Mar 2007 21:34:19 -0800
> From: Sherry Moore <sherry(dot)moore(at)sun(dot)com>
> To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
> Cc: Luke Lonergan <LLonergan(at)greenplum(dot)com>,
> Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>,
> Pavan Deolasee <pavan(at)enterprisedb(dot)com>,
> Gavin Sherry <swm(at)alcove(dot)com(dot)au>,
> PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>,
> Doug Rady <drady(at)greenplum(dot)com>,
> Sherry Moore <sherry(dot)moore(at)sun(dot)com>
> Subject: Re: [HACKERS] Bug: Buffer cache is not scan resistant
>
> Hi Tom,
>
> Sorry about the delay. I have been away from computers all day.
>
> In the current Solaris release in development (Code name Nevada,
> available for download at http://opensolaris.org) I have implemented
> non-temporal access (NTA) which bypasses L2 for most writes, and reads
> larger than copyout_max_cached (patchable, default to 128K). The block
> size used by Postgres is 8KB. If I patch copyout_max_cached to 4KB to
> trigger NTA for reads, the access time with 16KB buffer or 128MB buffer
> are very close.
>
> I wrote readtest to simulate the access pattern of VACUUM (attached).
> tread is a 4-socket dual-core Opteron box.
>
> <81 tread >./readtest -h
> Usage: readtest [-v] [-N] -s <size> -n iter [-d delta] [-c count]
> -v: Verbose mode
> -N: Normalize results by number of reads
> -s <size>: Working set size (may specify K,M,G suffix)
> -n iter: Number of test iterations
> -f filename: Name of the file to read from
> -d [+|-]delta: Distance between subsequent reads
> -c count: Number of reads
> -h: Print this help
>
> With copyout_max_cached at 128K (in nanoseconds, NTA not triggered):
>
> <82 tread >./readtest -s 16k -f boot_archive
> 46445262
> <83 tread >./readtest -s 128M -f boot_archive
> 118294230
> <84 tread >./readtest -s 16k -f boot_archive -n 100
> 4230210856
> <85 tread >./readtest -s 128M -f boot_archive -n 100
> 6343619546
>
> With copyout_max_cached at 4K (in nanoseconds, NTA triggered):
>
> <89 tread >./readtest -s 16k -f boot_archive
> 43606882
> <90 tread >./readtest -s 128M -f boot_archive
> 100547909
> <91 tread >./readtest -s 16k -f boot_archive -n 100
> 4251823995
> <92 tread >./readtest -s 128M -f boot_archive -n 100
> 4205491984
>
> When the iteration is 1 (the default), the timing difference between
> using 16k buffer and 128M buffer is much bigger for both
> copyout_max_cached sizes, mostly due to the cost of TLB misses. When
> the iteration count is bigger, most of the page tables would be in Page
> Descriptor Cache for the later page accesses so the overhead of TLB
> misses become smaller. As you can see, when we do bypass L2, the
> performance with either buffer size is comparable.
>
> I am sure your next question is why the 128K limitation for reads.
> Here are the main reasons:
>
> - Based on a lot of the benchmarks and workloads I traced, the
> target buffer of read operations are typically accessed again
> shortly after the read, while writes are usually not. Therefore,
> the default operation mode is to bypass L2 for writes, but not
> for reads.
>
> - The Opteron's L1 cache size is 64K. If reads are larger than
> 128KB, it would have displacement flushed itself anyway, so for
> large reads, I will also bypass L2. I am working on dynamically
> setting copyout_max_cached based on the L1 D-cache size on the
> system.
>
> The above heuristic should have worked well in Luke's test case.
> However, due to the fact that the reads was done as 16,000 8K reads
> rather than one 128MB read, the NTA code was not triggered.
>
> Since the OS code has to be general enough to handle with most
> workloads, we have to pick some defaults that might not work best for
> some specific operations. It is a calculated balance.
>
> Thanks,
> Sherry
>
>
> On Mon, Mar 05, 2007 at 10:58:40PM -0500, Tom Lane wrote:
> > "Luke Lonergan" <LLonergan(at)greenplum(dot)com> writes:
> > > Good info - it's the same in Solaris, the routine is uiomove (Sherry
> > > wrote it).
> >
> > Cool. Maybe Sherry can comment on the question whether it's possible
> > for a large-scale-memcpy to not take a hit on filling a cache line
> > that wasn't previously in cache?
> >
> > I looked a bit at the Linux code that's being used here, but it's all
> > x86_64 assembler which is something I've never studied :-(.
> >
> > regards, tom lane
>
> --
> Sherry Moore, Solaris Kernel Development http://blogs.sun.com/sherrym
>
> #include <stdlib.h>
> #include <stdio.h>
> #include <ctype.h>
> #include <unistd.h>
> #include <fcntl.h>
> #include <sys/param.h>
> #include <sys/time.h>
> #include <sys/mman.h>
> #include <errno.h>
> #include <thread.h>
> #include <signal.h>
> #include <strings.h>
> #include <libgen.h>
>
> #define KB(a) (a*1024)
> #define MB(a) (KB(a)*1024)
>
> static void
> usage(char *s)
> {
> fprintf(stderr,
> "Usage: %s [-v] [-N] -s <size> -n iter "
> "[-d delta] [-c count]\n", s);
> fprintf(stderr,
> "\t-v:\t\tVerbose mode\n"
> "\t-N:\t\tNormalize results by number of reads\n"
> "\t-s <size>:\tWorking set size (may specify K,M,G suffix)\n"
> "\t-n iter:\tNumber of test iterations\n"
> "\t-f filename:\tName of the file to read from\n"
> "\t-d [+|-]delta:\tDistance between subsequent reads\n"
> "\t-c count:\tNumber of reads\n"
> "\t-h:\t\tPrint this help\n" );
> exit(1);
> }
>
> #define ABS(x) ((x) >= 0 ? (x) : -(x))
>
> static void
> format_num(size_t v, size_t *new, char *code)
> {
> if (v % (1024 * 1024 * 1024) == 0) {
> *new = v / (1024 * 1024 * 1024);
> *code = 'G';
> } else if (v % (1024 * 1024) == 0) {
> *new = v / (1024 * 1024);
> *code = 'M';
> } else if (v % (1024) == 0) {
> *new = v / (1024);
> *code = 'K';
> } else {
> *new = v;
> *code = ' ';
> }
> }
>
> static size_t
> parse_num(char *s)
> {
> size_t v = 0;
>
> for (;;) {
> switch (tolower(*s)) {
> case '0':
> case '1':
> case '2':
> case '3':
> case '4':
> case '5':
> case '6':
> case '7':
> case '8':
> case '9':
> v = v * 10 + *s - '0';
> ++s;
> continue;
>
> case 'k':
> v *= 1024;
> return (v);
>
> case 'm':
> v *= (1024 * 1024);
> return (v);
>
> case 'g':
> v *= (1024 * 1024 * 1024);
> return (v);
>
> default:
> return (v);
> }
> }
> }
>
> /*
> * * create a memry segment with a given pagesize
> * */
> static void *
> create_memory(size_t size, size_t pagesize)
> {
> caddr_t p;
>
> p = mmap((void *)pagesize, size, PROT_WRITE|PROT_READ,
> MAP_ALIGN|MAP_PRIVATE|MAP_ANON, -1, 0);
>
> if (p == MAP_FAILED) {
> char code;
> size_t out;
>
> format_num(pagesize, &out, &code);
> fprintf(stderr, "mmap(%lu%c,", out, code);
>
> format_num(size, &out, &code);
> fprintf(stderr, " %lu%c, ...)", out, code);
>
> perror("failed");
> exit(1);
> }
>
> return (p);
> }
>
>
> int
> main (int argc, char **argv)
> {
> hrtime_t start, end, total = 0;
> unsigned int i;
> unsigned int iterations = 1;
> size_t pagesize = getpagesize();
> size_t size = 1024;
> longlong_t j;
> longlong_t k;
> char *table;
> volatile int value;
> int c;
> int verbose = 0;
> int delta = 1;
> int normalize = 0;
> size_t count;
> size_t count_requested = 0;
> double normalized;
> char filename[256];
>
> while ((c = getopt( argc, argv, "Nhvc:d:f:s:n:")) != EOF) {
> switch (c) {
> case 'n':
> iterations = parse_num(optarg);
> break;
> case 's':
> size = parse_num(optarg);
> break;
> case 'v':
> verbose = 1;
> break;
> case 'd':
> delta = atoi(optarg);
> break;
> case 'c':
> count_requested = parse_num(optarg);
> break;
> case 'f':
> strcpy(filename, optarg);
> break;
>
> case 'N':
> normalize = 1;
> break;
> case 'h':
> default:
> usage(basename(argv[0]));
> break;
> }
> }
>
> if (ABS(delta) >= size) {
> fprintf(stderr, "delta %llu is larger than size %llu\n",
> ABS(delta), size);
> exit(1);
> }
>
> count = count_requested ? count_requested : size;
>
> if (verbose)
> printf("Creating table of %llu bytes\n", size);
>
> table = create_memory(size, pagesize);
>
>
> for (i = 0; i < iterations; i++) {
> int n;
> int offset = 0;
> int fd = -1;
>
> if ((fd = open(filename, O_RDONLY)) < 0) {
> perror("open");
> exit(1);
> }
>
> k = size - 1;
> start = gethrtime();
> while ((n = read(fd, &table[offset], KB(8))) >0) {
> offset += n;
> offset %= size;
> }
>
> end = gethrtime();
> total += (end - start);
> normalized = (double)(end - start) / count;
> if (verbose) {
> printf("total time: %llu, normalized time: %g\n",
> end - start, normalized);
> } else if (normalize) {
> printf("%g\n",
> (double)(end - start) / count);
> }
> close(fd);
> }
> printf("%llu\n", total);
> exit(0);
> }
>
>
> ----- End forwarded message -----
>
> --
> Sherry Moore, Solaris Kernel Development http://blogs.sun.com/sherrym

--
Sherry Moore, Solaris Kernel Development http://blogs.sun.com/sherrym

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Sherry Moore 2007-03-06 06:53:13 Re: NTA access on Solaris
Previous Message NikhilS 2007-03-06 06:47:10 Re: PrivateRefCount (for 8.3)