Re: "Shared strings"-style table

From: Rob Sargent <robjsargent(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: "Shared strings"-style table
Date: 2017-10-13 16:09:05
Message-ID: dd0a0ae0-c28c-92d7-0650-559a4a43dd7b@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 10/13/2017 09:49 AM, Seamus Abshere wrote:
> hey,
>
> In the spreadsheet world, there is this concept of "shared strings," a
> simple way of compressing spreadsheets when the data is duplicated in
> many cells.
>
> In my database, I have a table with >200 million rows and >300 columns
> (all the households in the United States). For clarity of development
> and debugging, I have not made any effort to normalize its contents, so
> millions of rows have, for example, "SINGLE FAMILY RESIDENCE /
> TOWNHOUSE" (yes, that whole string!) instead of some code representing
> it.
>
> Theoretically / blue sky, could there be a table or column type that
> transparently handles "shared strings" like this, reducing size on disk
> at the cost of lookup overhead for all queries?
>
> (I guess maybe it's like TOAST, but content-hashed and de-duped and not
> only for large objects?)
>
> Thanks,
> Seamus
>
> --
> Seamus Abshere, SCEA
> https://www.faraday.io
> https://github.com/seamusabshere
> https://linkedin.com/in/seamusabshere
>
>
What data type are these columns now? I would be tempted to map the
full strings to an abbreviation just so I didn't have to alter all the
columns to an "id"; Optional to place any RI on the columns to the
abbreviation dictionary table. Just use the translation as a last step
in user facing reports. If you can map/abbreviate to 4 characters,
you've approximated the disk size of an integer.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message David G. Johnston 2017-10-13 16:12:23 Re: "Shared strings"-style table
Previous Message Seamus Abshere 2017-10-13 15:49:21 "Shared strings"-style table