Efficient traversal cost aggregation over all simple paths/cycles


#1

I want to move to a graph analytics platform. I’m trying to pick between mapd, blazegraph, and some of those very cool nosql graph databases like arangodb/neo4j.

I’d like to keep track of various cost functions on all simple paths of a directed graph with fixed topology, in real time. My edge weights are updated ~100 times/second. My graphs have between 5-10 nodes and are pretty densely connected: in total, there are anywhere between 5000 and 20000 total simple paths and cycles on the graph. Right now, I’m using rethinkdb (a nosql database) and python. I’m precomputing all possible simple paths, storing them in a hash map, and just brute force recalculating them every time an edge weight is updated. My hash map points a given edge (whose weight has just been updated) to all the simple paths and cycles that it is a part of. Then I go and recalculate all of those.

Well, this is very slow and doesn’t scale!

The cost functions themselves are commutative/associative (sums and products), but I’m struggling to understand what kind of algorithm I should to compute them efficiently in parallel. The gather-apply-scatter approach feels like the right direction, except that I’m not sure how I’d actually use it to keep track only of simple paths.

Can someone help me understand how this problem has been solved before and how I might find more information about doing this with a GPU graph analytics platform? Is mapd suited to this kind of thing? Is GAS the right approach? How would I google this problem effectively?


#2

Hi - interesting problem. You said your graph has 5-10 nodes - is that the right number? i’m wondering how you get to 5000 or greater distinct paths (including cycles) at that size.

Also, do you need to eagerly compute the cost on paths? Are these computations dependent on the prior state in some way? Is it feasible/acceptable to do that lazily?


#3

Yes, that’s the right number!

The graphs are directed in the sense that there are different edge weights depending on which direction you’re going: I’m thinking of A→B and B→A as separate edges, since there are different weights in each direction.

But technically… I just checked and saw that to really get ~20,000 simple paths/cycles I needed to have 11 nodes. Adding that particular node, going from 10 to 11 nodes, takes the number of simple paths/cycles from ~10,000 to ~20,000. Just goes to show why a lot of graph problems are NP complete! By the way, these are subgraphs of a much larger graph, but it just feels impossible to focus on the entire graph at once. I’d love to focus on as big of a subgraph as possible, but you can see the problem scales very, very fast.

I was computing them eagerly for a couple of reasons. The main reason was to keep all of these paths indexed and sorted for fast, unlimited querying. The second reason was to be able to record some data for time-series analysis, e.g., if I wanted to study at the time-evolution of traversal costs for all simple paths from node A to B. Recording a time series of all ~10,000-20,000 simple paths at all times is not necessary, but I wouldn’t want to be able to do this for a little while at a time.

One obvious inefficiency in my current approach seemed to be in the wasteful redundant computation of every single path, even though some are aggregates of each other. For example, the cost of A→B→C→D→E is a composition of A→B→C and C→D→E. So why not compute them smartly? I came up with a way of doing this and it just didn’t seem to help one single bit, which made me think I really needed to take a step back.

So I went on the internet and did some searching, and stumbled on this very relevant article:
https://blog.blazegraph.com/?p=628 . It says:

The big graph anti-pattern is “Throw everything into a big graph and then using the same tools that gave us horizontal scaling for other problems: map/reduce and key-value stores.”

That’s exactly what I’ve been doing wrong.

So now I’m re-evaluating my whole approach. While those super cool graph-query databases (especially arangodb) are very appealing to me, this article helps me understand that my main bottleneck is with graph traversal, and those probably aren’t going to help me with that. Specifically, my problem is to traverse simple paths and efficiently collect costs. I have a gtx 1080 sitting around just waiting to be used, and I think it’s the right tool for the job. The question is: how to do it!