| From: | Nick Guenther <nguenthe(at)uwaterloo(dot)ca> | 
|---|---|
| To: | pgsql-general(at)postgresql(dot)org | 
| Subject: | Re: Watching Views | 
| Date: | 2014-09-19 02:40:58 | 
| Message-ID: | 20140918224058.56864yjg0a68d1c0@www.nexusmail.uwaterloo.ca | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-general | 
Ah! Your reply was excellent, David. I only found it now, cleaning out  
my inbox. Comments inline!
Quoting David G Johnston <david(dot)g(dot)johnston(at)gmail(dot)com>:
> Nick Guenther wrote
>> As you said, attaching the trigger to a view is useless (for
>> BEFORE/AFTER, which I'm interested in, also only works on statement
>> level changes, which I would rather not have to deal with). I tried
>> attaching my trigger to a materialized view and found that postgres
>> does not support that; as you said, I need to write a trigger on the
>> source to keep the materialized view and the source in sync. But in
>> that case I don't think a materialized view helps me at all, since
>> without triggers on it I can't even have it compute the diffs for me
>> (and I suspect that internally postgres simply truncates the old table
>> and refills it, so there would be nothing to hook anyway).
>>
>> My bottleneck is the size of my database and the network: I want to
>> take slices of a potentially gigabytes-large database and stream them
>> out to multiple clients. Thus I need to stream--not poll--for changes.
>> I think a materialized view would force me to poll for changes, and in
>> that case I would be doing REFRESH MATERIALIZED VIEW on each poll and
>> therefore not winning anything over a regualar view. Ideally, when an
>> update comes in to a parent table I want all views that have sliced it
>> to be notified; I'm interested in doing dataflow* programming,
>> essentially. Is there maybe some sort of extension that does
>> dependency tracking inside of postgres?
>
> While PostgreSQL has materialized view functionality built in the concept is
> general and can be done manually.  Instead of the main table having the link
> to the FIFO I was thinking you would instead replicate record changes to all
> active subset tables and then triggers on those tables would send the
> relevant changes out to the world.
>
> Keep in mind you can attach a where clause to your trigger, and I think you
> can pass in arguments to it as well.  You should have on trigger per view
> attached to the source table - though with good meta-data and some imposed
> limits you can probably pool some clients into the same stream.
I actually ended up independently deciding on using triggers.
I spent a long time agonizing over trying to avoid triggers
because tie me to postgres, but they ended up not being so bad
and I even think that duplicating the work for MySQL at least should
not be that bad.
The hooks I wrote are in python (don't laugh) and are at  
https://github.com/kousu/modex/tree/databased/src/backend/db/replicate.pysql.
All the hooks do are export {INSERT,UPDATE,DELETE}s to json--
So far filtering is only in my frontend--in javascript. I did this because
  http://nytimes.github.io/pourover/
rightly points outthat centralizing database slicing means creating
an artificial bottleneck. If you're curious, what I've got so far is  
at https://github.com/kousu/modex/blob/databased/src/frontend/tables.js
But now I'm musing on how to do the same in the database.
For context, my use case has several game instances in parallel. So,  
each game has many clients which should share a part of the database,  
and each client individually further slices the data depending on what  
its user wants to investigate. Sharing data between the games and then  
having them use their in-javascript triggers to just ignore each other  
is possible--and my current implementation supports that--but it would  
be a lot better and less memory and bandwidth heavy if I could have  
the database do the initial "share a part" slice, and even better  
slices could be deduplicated--that is, if I could  pooling clients  
onto the a single table instead.
Right now, I foresee the final javascript API being something like
```
var DB = Remote_DB.table("buildings")           //create an AST representing
                                 .where(Column("run_id")  // "select  
location, kind, cost from buildings where run_id = $1"
                                                       .eq(this.run_id))
                                 .select("location", "kind", "cost")
DB = DB.execute()  //invoke the AST inside of postgres, which should  
wire up triggers doing dataflow
var points = DB.where(Column("location")  //create an in-javascript  
dataflow node
                         .within((x,y), 500))               // which  
uses triggers (currently
                                                                    //  
PourOver's Events) to watch
                                                                    //  
the source for changes.
var expensive = DB.max("expensive")         // Create another one, which will
                                                                    //  
update simultaneously with its sibling
                                                                    //  
when the source database changes
```
I would like the AST (and postgres hooks that back it) to have all the  
same methods as the javascript-dataflow part so that changing the  
boundary of where the processing is happening is transparent.
I'm not super fluent in Postgres. Would the best way to do this be to  
create throwaway tables named with UUIDs or something? That seems  
inelegant. The other thing I've thought of doing is building a  
dataflow system (necessarily including a novel query language) in  
front of postgres, but that seems even less elegant.
What I'm thinking about is some kind of awesome overarching  
meta-everything stored-procedure dataflow system. Should I be  
surprised that I haven't seen anything like this before? Dataflow and  
databases seem like such a natural fit and postgres is so popular that  
I suspect I've missed something. Though, I suppose, historically, SQL  
has lived in interactive, centralized-mainframe systems.
> Database size is less an issue compared to the change rate of the affected
> table(s).  Triggers let you plug into the change stream.
Can you expand on this point? I think I may not have explained myself clearly.
I am worried about my database size because, right now, when the 1000th
game is played, the users need to download 999 previous complete games.
This won't scale.
> You could even cascade the triggers so less restrictive filters are grouped
> together at one layer and those materialized views then forward to other
> tables with more restrictive filters.
This is a good idea!  Automating figuring out what can be factored out  
to higher levels efficiently seems like a whole research project in  
itself, but I'll leave a giant big TODO about that and credit you for  
the idea.
> If you make these other tables UNLOGGED you should be able to mitigate the
> performance hit somewhat.
This is a really good idea that I did not think about. I only learned  
about UNLOGGED the other day from a different thread. If I'd been  
paying attention to this thread it would have been in immediately!
> The client, not the trigger, should create the FIFO.  If the FIFO is
> unusable the trigger should update a control table and a monitor process
> should remove that trigger at the next scan.  This should then be extended
> to provide control over the addition and removal of viewers and their
> corresponding schematic objects.
That's funny. I sort of did end up doing that, but not for any  
particular reason.
I have a middleware script between javascript and postgres because  
plpython dies once the session dies, and I couldn't figure out how to.
That middleware script creates the FIFO:  
https://github.com/kousu/modex/blob/databased/src/backend/db/replicate.py#L109
And the stored procedure just writes down what it's told:
https://github.com/kousu/modex/blob/databased/src/backend/db/replicate.pysql#L162
The middleware is supposed to delete itself too:  
https://github.com/kousu/modex/blob/databased/src/backend/db/replicate.py#L135
but if it hard-crashes,   
https://github.com/kousu/modex/blob/databased/src/backend/db/replicate.pysql#L72 cleans up in the way that you  
mentioned.
> The view tables also help avoid the problem since then even if a FIFO write
> fails you have an idea of what should have been, but was not, written and
> can cause it to be written later once the client is listening.
I have mulled this but decided it is dangerous. Consider an attack  
like a TCP SYN DDoS: attacker opens M clients but only carries the  
session long enough to know the database has started replicating, and  
then just drops further packets. Then you have M*#(changes) of backlog  
going nowhere. Instead, my middleware script catches new clients up to  
the current state, so that if a client drops it just needs to  
reconnect and wait before getting anything new off the change stream,  
though I'm not 100% sure my logic for that won't break under load
((this code starts  
https://github.com/kousu/modex/blob/databased/src/backend/db/replicate.py#L191  
and I would appreciate anyone telling me if I'm misusing cursors and  
connections))
> Before stock 9.4 triggers are all you got.  9.4 gives logical wal but no
> clue how that works.  There my be solutions via third-party tools like Slony
> but I am not familiar with them but they focus on direct database
> replication.
Yes! I am excited for this! replicate.pysql seems to implement much of  
the same ideas as logical replication, but with a different API.
In theory, I want more than just replication, though. I want  
full-fledged dataflow; e.g. I want, say
CREATE DATAFLOW VIEW AS .....
which is like MATERIALIZED VIEW except instead of REFRESH MATERIALIZED  
VIEW the view depends on triggers that keep it in sync automatically.  
And to be clear, I don't just want filtered replication  
(http://docs.couchdb.org/en/latest/replication/protocol.html?highlight=filtered%20replication#filter-replication) I want to be able to do operations too; consider this dataflow DISTINCT operator that I  
wrote:
https://github.com/kousu/modex/blob/databased/src/frontend/tables.js#L832
But maybe exposing a rich set of operators is another DDoS hole...  
what do you think?
> Mostly theorizing as I've never actually coded this kind of process...so
> some of these ideas may not pan out.
>
> David J.
>
Well, in a year or so I will hopefully have something useful--and in  
C, not python--to share!
-- 
Nick Guenther
4B Stats/CS
University of Waterloo
| From | Date | Subject | |
|---|---|---|---|
| Next Message | cowwoc | 2014-09-19 04:17:26 | 64-bit TXID? | 
| Previous Message | Alvaro Herrera | 2014-09-19 02:37:55 | Re: [GENERAL] pg_multixact issues |