Re: full text search on hstore or json with materialized view?

From: Rj Ewing <ewing(dot)rj(at)gmail(dot)com>
To: George Neuner <gneuner2(at)comcast(dot)net>
Cc: PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: full text search on hstore or json with materialized view?
Date: 2017-04-20 15:50:31
Message-ID: CAOSSsV3PQaGPLGnXnv6QL0t4TrSSX_7UOk1OXZbigWy9v6Wx5w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, Apr 19, 2017 at 9:55 PM, George Neuner <gneuner2(at)comcast(dot)net> wrote:

> On Wed, 19 Apr 2017 16:28:13 -0700, Rj Ewing <ewing(dot)rj(at)gmail(dot)com>
> wrote:
>
> >okay, messing around a bit more with the secondary k,v table it seems like
> >this could be a good solution..
> >
> >I created a keys table to hold the 63 key values, then I dropped and
> >recreated the secondary table, using a FK referencing the keys table. I'm
> >not really sure why, but a basic full text query on 44 million row is
> >taking aproxx. 20ms.
>
> That pretty much confirms your statistics were bad ... using the FK
> table or not wouldn't make any difference to the planner.
>
> But if you are getting 20ms on 44M rows, then one or more of the
> following must be true:
> - your text values must be very short
> - your FTS queries must be very simple
> - you aren't reading the results
>

​text is on average very short. 1-3 words per value.​ It was a count(*)
query with only a single condition.

> For comparison:
>
> I have an application that does FTS on a table of NAICS descriptions
> indexed using tsvectors with an average length of 4.8 tokens per. It
> does a 3-part All/Any/None term search.
>
> On my 24-core 2.4GHz server, a single threaded query with the whole
> table and index in memory takes ~1 ms to search 20K rows using a
> realistic tsquery: e.g.,
>
> SELECT code,description
> FROM naics
> WHERE ts_index @@ to_tsquery('packaged & software & !(wholesale)')
>
> [getting the data out of Postgresql takes longer than the search]
>
>
> GIN indexes don't exactly scale linearly, and tsquery is, in general,
> much more dependent on the lengths of the tsvectors than on the
> complexity of the match, but with 44M rows of similarly distributed
> data, a similarly realistic query would be expected to take well over
> 1 second.
>
>
> My example is genuine but too small to bother parallelizing [mentioned
> in a previous message]. Since you are *testing* with 1M records (that
> create 44M k:v shards), I am assuming you will need to deal with much
> more than that in deployment. And if you think you need FTS, then you
> must be expecting more than simple word matches [as below], else you
> might do something simpler like
>
> SELECT ...
> WHERE val ILIKE <value>
> ​
>

​the 1M records would most likely be the max. On average the tables would
have more like 100,000 records each.

I also realized that I inserted all k:v pairs into the secondary k:v table.
In reality, I would only index strings, which would eliminate approx 25% of
the k:v bringing that number down closer to 30M.

from my understanding, *ILIKE* doesn't do any text normalization, which is
something we would like to have.​

> ​
>
>
> >my table structure is:
> >
> > Table "public.samples_lg_txt"
> > Column | Type | Modifiers
> >--------+----------+-----------
> > id | integer |
> > key | integer |
> > val | text |
> > tsv | tsvector |
> >Indexes:
> > "idx_tsv_samples_lg_text" gin (tsv)
> >Foreign-key constraints:
> > "samples_lg_txt_id_fkey" FOREIGN KEY (id) REFERENCES samples_lg(id)
> > ON DELETE CASCADE
> > "samples_lg_txt_key_fkey" FOREIGN KEY (key) REFERENCES keys(id)
> >
> >
> >how would I write an AND query that filtered on 2 separate keys from the
> >samples_lg_txt table?
> >
> >something like:
> >
> >SELECT COUNT(*) FROM samples WHERE id IN ( SELECT DISTINCT(s.id) FROM
> >samples_lg_txt s JOIN keys k ON s.key = k.id WHERE (*name = 'key1' AND
> tsv
> >@@ to_tsquery('value1')) AND (name = 'key2' AND tsv @@
> >to_tsquery('value2'))*;
>
> You're overthinking it
>
> SELECT count(distinct s.id)
> FROM samples_lg_txt AS s
> JOIN keys AS k ON k.id = s.key
> WHERE (k.name = 'key1' AND s.tsv @@ to_query('value1')
> OR (k.name = 'key2' AND s.tsv @@ to_query('value2')
>
> ​but that is an OR query, I'm trying to do an AND query.

> There's actually no need to join if you can use the key name instead
> of an integer id. You can FK on strings, so you can still maintain an
> identity table of keys. E.g.,
>
> > id | integer |
> > key | vchar(32) | FK key(name) ...
> > val | text |
> > tsv | tsvector |
>
>
> Then the query could be just
>
> SELECT count(distinct id)
> FROM samples_lg_txt
> WHERE (key = 'key1' AND tsv @@ to_query('value1')
> OR (key = 'key2' AND tsv @@ to_query('value2')
>
>
> Just a reminder [it's late here 8-)]: FK columns contain values - not
> weird references to the foreign tables. The constraint just enforces
> that any value inserted/updated into the FK column matches an existing
> value in the relevant foreign table.

​this would make queries simpler​. I guess a disadvantage to using a string
for the key is that the db size would be larger, and thus not as likely to
fit the entire table in ram. If there are only 63 keys across 44M rows, it
seems that storing an smallint would take less space then storing the
string.

I don't really have a need for the identity table of keys. It's only
purpose was to shrink the database size.

Thanks again for the detailed responses!

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Samuel Williams 2017-04-20 21:43:59 Re: Large data and slow queries
Previous Message Rj Ewing 2017-04-20 15:15:54 Re: full text search on hstore or json with materialized view?