ltree and full text search questions

From: "Marc Mamin" <M(dot)Mamin(at)intershop(dot)de>
To: <pgsql-general(at)postgresql(dot)org>
Cc: <teodor(at)sigaev(dot)ru>
Subject: ltree and full text search questions
Date: 2009-03-26 14:57:12
Message-ID: C4DAC901169B624F933534A26ED7DF31010A5093@JENMAIL01.ad.intershop.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


Hello,

(sorry for this long mail)

I have started to evaluate ltree and tsearch (first on Windows with PG
Version 8.3.7) and I would apprecicate some clarification. The first
part deals with ltree where I have some questions, and the second part
is a concept to combine ltree with the full text search capabilities.
Here I'd like to get your opinion whether I'm on the right way or not.
Moreover I will probably need my own parser for the full text search and
I would be very gratefull if anybody could provide me with some C code
example that is similar to my requirement.

and kudos for ltree and the full text search. I'm impressed by their
performances :-)

I would like to use these two functionalities to store and analyze
paths, (the model beyond the data is not a tree, but a dense directed
graphs).

A) ltree
========

A ltree path would look like this: ...2_456.7_3425.1_23.9_231....
whereas each node is a combination of 2 information:
<class_id>_<item_id>

If my idea works well, I may also like to put some more information in
each node which would hence be a micropath of fix length (3 or 4)

With such a model, I could offer global path analysis on the <class>,
and drilldown possibilities on the <items>.

I came to this idea as the documentation of ltree says that a node is a
list of words separated by the '_' character.

a short example:

--drop table paths ;
create table paths (id int, path ltree);
insert into paths(id,path)values(1, '1_11.2_13.3_10.4_13');
insert into paths(id,path)values(2, '1_12.4_15.3_11.4_10.15_14.1_11');
insert into paths(id,path)values(3, '1_11.2_13.3_10.4_10');
insert into paths(id,path)values(4, '1_12.4_15.3_11.3_10.13_14.13_14');
insert into paths(id,path)values(5, '1_11.2_13.3_10.2_13');
insert into paths(id,path)values(6, '1_12.4_15.3_11.1_10.12_14.1_11');
insert into paths(id,path)values(7, '127_1235');

--now I can e.g retrieve all items that have a path from <class 2> to
<class 4> using the prefix matching:

select id from paths where path ~ '*.2_*.*.4_*.*'::lquery -->(1 & 3)
--drop table paths ;

And now my questions and comments:

1)

From the module description, I first though that the '_' character had a
special meaning
but it just seems to be an extra allowed character beside [a-zA-Z0-9].
Am I correct or is '_' defined internally as separator for
indexing/searching ltree data ?

2)
The documentation says that the length of a label path is limited to
65kB. This is the max number of nodes, and not the size of the string.
Moreover, when you try to define a larger path, it will be silently cut
without notice or error (seems that the first 65kB are just dropped, but
I'm not sure about this).

Maybe it would be a good thing to add this to the documentation.

3)

Gist index

create index path_gist_idx on paths using gist (path);
=> ERROR: index row requires 621840 bytes, maximum size is 8191

So it seems that gist indexes can only be used for paths that do not
exceed 8kB in size,
which is much less than the 65kB limit in depth.

Is this correct or am I missing a point ?
(I also have PostGIS 1.3.5 installed. Could this be an issue ?)

B) Full text search
===================

ltree offers a prefix search on the nodes, but no suffix search which I
need to look for given <item_id>.
So my idea is to combine ltree with full text search.

If a node has a fix format like <class>_<item>, I could use full text
search indexing to look for paths containing a given item.

My first step was to check the default parser which works really badly
in this case:

SELECT alias, description, token FROM ts_debug('1_12.4_15.3_11.1_10');

alias description token
----- ----------- -----
uint Unsigned integer 1
blank Space symbols _
float Decimal notation 12.4
blank Space symbols _
float Decimal notation 15.3
blank Space symbols _
float Decimal notation 11.1
blank Space symbols _
uint Unsigned integer 10

So I guess I need my own parser which should return something like:
(this would be a specific parser that had to know the exact pattern and
meaning of a node)

alias description token
----- ----------- -----
class class_id 1
blank Space symbols _
item item_id 12
blank Space symbols .
class class_id 4
blank Space symbols _
item item_id 15
blank Space symbols .
class class_id 3

As already said, I would be very gratefull if anybody could provide me
with some C code example that is similar to my requirement.

Having this, I would then define different search configuration:

CREATE TEXT SEARCH CONFIGURATION ltclass;
CREATE TEXT SEARCH CONFIGURATION ltitem;

ALTER TEXT SEARCH CONFIGURATION ltclass DROP MAPPING FOR ltitem;
ALTER TEXT SEARCH CONFIGURATION ltitem DROP MAPPING FOR ltclass;

and build different search indexes for each kind of data contained in a
node:

CREATE INDEX path_class_ix ON paths USING gin(to_tsvector('ltclass',
ltree2text(path)));
CREATE INDEX path_item_ix ON paths USING gin(to_tsvector('ltitem',
ltree2text(path)));

Finally using both ltree and full text search would provides a very
powerfull framework for path analysis where a node is not only a single
reference,
but a combination of some pieces of information :-)

And a last question: the alias below exists only in the parser.
I can't use it later on to restrict a search to token of a given alias.
Right ?

Thank you for your attention.

Marc Mamin

Browse pgsql-general by date

  From Date Subject
Next Message Albe Laurenz 2009-03-26 15:07:32 Re: Weird encoding behavior
Previous Message Alvaro Herrera 2009-03-26 13:55:53 Re: Weird encoding behavior