Re: Graph datatype addition

From: Jim Nasby <jim(at)nasby(dot)net>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc: Misa Simic <misa(dot)simic(at)gmail(dot)com>, Jaime Casanova <jaime(at)2ndquadrant(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject: Re: Graph datatype addition
Date: 2013-05-08 19:07:45
Message-ID: 518AA281.5030400@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 5/8/13 1:40 PM, Atri Sharma wrote:
> On Thu, May 2, 2013 at 7:58 AM, Atri Sharma <atri(dot)jiit(at)gmail(dot)com> wrote:
>>
>>
>> Sent from my iPad
>>
>> On 02-May-2013, at 4:33, Misa Simic <misa(dot)simic(at)gmail(dot)com> wrote:
>>
>>
>>
>> On Wednesday, May 1, 2013, Atri Sharma wrote:
>>>
>>> Hi all,
>>>
>>> Please find a probable prototype for the same:
>>>
>>> struct GraphNode
>>> {
>>> Oid NodeOid; // Oid of the row which is the node here. We will
>>> store an identifier to it here rather than the complete row(with data)
>>> itself.
>>> AdjacencyList *list; // Pointer to the node's adjacency list.
>>> };
>>>
>>> struct AdjacencyList
>>> {
>>> Oid[] neighbours_list;
>>> };
>>>
>>> struct AdjacencyList is probably the 'hottest' data structure in our
>>> entire implementation. We can think of making a cache of recently
>>> accessed struct AdjacencyList instances, or the AdjacencyList(s) of
>>> the neighbours of the recently accessed nodes, because, they are most
>>> likely to be accessed in near future. Advice here, please?
>>>
>>> So.
>>>
>>> struct AdjacencyCache
>>> {
>>> Oid[] cache_values;
>>> };
>>>
>>> push and pop functions for AdjacencyCache follow.
>>>
>>> We need a replacement and invalidation algorithm for the cache. I feel
>>> a version of LRU should be good here.
>>>
>>> I have not given a prototype for operations and algorithm implementations.
>>>
>>> I feel,as suggested by Peter and Jaime, we can look at pgRouting code
>>> for algorithm implementations.
>>>
>>> Florian's concerns are mitigated here to some extent,IMO. Since the
>>> nodes and linkings are loosely coupled, and not represented as a
>>> single representation, updating or changing of any part or adding a
>>> new edge is no longer an expensive operation, as it only requires a
>>> lookup of GraphNode and then its AdjacencyList. If we use the cache as
>>> well, it will further reduce the lookup costs.
>>>
>>> I have not yet thought of the user visible layer as suggested by Jim.
>>> Probably. once we are ok with the internal layer, we can move to the
>>> user visible layer.
>>>
>>> Advice/Comments/Feedback please?
>>>
>>
>> Honestly - I think I dont understand proposal...
>>
>> Datatypes - are about values - what will be stored in that column in a
>> table....
>>
>> Datatype - cant have any clue about "rows"
>>
>> How I understand what you described - you can achieve the same with pure SQL
>> - struct are equvalent to graph tables... Instead od Oid column will store
>> PKs of nodes table...
>>
>>
>> Yes, I agree.I need to think more.
>>
>> Let me get back with a deeper proposal.
>>
>> Regards,
>>
>> Atri
>
> Hi all,
>
> In continuation of the above discussion,I have been thinking about the
> design of the data type. I have been thinking on the lines of making a
> multi dimensional data structure for the same:
>
> Specifically, I have been thinking about making multi lists for
> representing data. After some research, I think that the following
> data structure may help:
>
> Each node will be represented as:
>
> [Down Pointer][Atom][Right Pointer]
>
> Suppose, a graph is like(sorry for the bad drawing):
>
> B
> /
> A D
> \ /
> C
> \
> E
>
> can be represented as:
> C's data E's data
> D's data
> ^ ^
> ^
> A's data
> [|][1][---------------------->[|][1][------------------>[|][1][NULL]
> ^ ^
> [|][1][------------------>[|][0][--------------------->[|][1][NULL]
> ^
> B's data
>
>
> Essentially, the Atom flag denotes if the node has any out edges from
> it. If it has no out edge, ATOM is 0 and Down Pointer points to an
> auxiliary structure which holds the node's data(hence, the data can be
> of arbitrary size).
>
> If the node has some out degree, then, those nodes are added to a new
> sublist which starts from the node which spawns those nodes.Node's
> down pointer points to the head of the new sublist.
>
> Essentially, a sublist has all the nodes directly spawning from the
> head node of the sublist.
>
> This approach has multiple advantages in term of memory and
> efficiency. Also, isolating sub graphs based on some criteria is
> pretty efficient, which is good for many analytics based operations.
>
> Access time is restricted as well.
>
> Thoughts/Comments/Feedback please?

Your second drawing didn't really make any sense to me. :(

I do think it would be most productive to focus on what the API for dealing with graph data would look like before trying to handle the storage aspect. The storage is potentially dirt-simple, as others have shown. The only challenge would be efficiency, but it's impossible to discuss efficiency without some clue of how the data will be accessed. Frankly, for the first round of this I think it would be best if the storage really was just some raw tables. Once something is available people will start figuring out how to use it, and where the API needs to be improved.
--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Atri Sharma 2013-05-08 19:16:59 Re: Graph datatype addition
Previous Message Atri Sharma 2013-05-08 18:40:48 Re: Graph datatype addition