Re: Reducing tuple overhead

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, hlinnaka <hlinnaka(at)iki(dot)fi>, Bruce Momjian <bruce(at)momjian(dot)us>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Subject: Re: Reducing tuple overhead
Date: 2015-06-07 08:04:55
Message-ID: CAM3SWZS0GyUaiFx97oYJuirmcW1MsojmEAtoEF7WCgxdppNOXg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 30, 2015 at 6:54 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> The other, related problem is that the ordering operator might start
> to return different results than it did at index creation time. For
> example, consider a btree index built on a text column. Now consider
> 'yum update'. glibc gets updated, collation ordering of various
> strings change, and now you've got tuples that are in the "wrong
> place" in the index, because when the index was built, we thought A <
> B, but now we think B < A. You would think the glibc maintainers
> might avoid such changes in minor releases, or that the Red Hat guys
> would avoid packaging and shipping those changes in minor releases,
> but you'd be wrong.

I would not think that. Unicode Technical Standard #10 states:

"""
Collation order is not fixed.

Over time, collation order will vary: there may be fixes needed as
more information becomes available about languages; there may be new
government or industry standards for the language that require
changes; and finally, new characters added to the Unicode Standard
will interleave with the previously-defined ones. This means that
collations must be carefully versioned.
"""

Also, in the paper "Modern B-Tree Techniques", by Goetz Graefe, page
238, it states:

"""
In many operating systems, appropriate functions are provided to
compute a normalized key from a localized string value, date value, or
time value. This functionality is used, for example, to list files in
a directory as appropriate for the local language. Adding
normalization for numeric data types is relatively straightforward, as
is concatenation of multiple normalized values. Database code must not
rely on such operating system code, however. The problem with relying
on operating systems support for database indexes is the update
frequency. An operating system might update its normalization code due
to an error or extension in the code or in the definition of a local
sort order; it is unacceptable, however, if such an update silently
renders existing large database indexes incorrect.
"""

Unfortunately, it is simply not the case that we can rely on OS
collations being immutable. We have *no* contract with any C standard
library concerning collation stability whatsoever. I'm surprised that
we don't see hear more about this kind of corruption.
--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2015-06-07 09:32:00 Re: Reducing tuple overhead
Previous Message Amit Kapila 2015-06-07 05:13:18 Re: Reducing tuple overhead