The Day of Graph Analytics
Note: consider this post as a continuation of the "Making it interactive" post by Orri Erling.
I have now completed the Virtuoso TPC-H work, including scale out. Optimization possibilities extend to infinity but the present level is good enough. TPC-H is the classic of all analytics benchmarks and is difficult enough, I have extensive commentary on this on my blog (In Hoc Signo Vinces series), including experimental results. This is, as it were, the cornerstone of the true science. This is however not the totality of it. From the LDBC angle, we might liken this to the last camp before attempting a mountain peak.
So, we may now seriously turn to graph analytics. The project has enough left to run in order to get a good BI and graph analytics workload. In LDBC in general, as in the following, BI or business intelligence means complex analytical queries. Graph analytics means graph algorithms that are typically done in graph programming frameworks or libraries.
The BI part is like TPC-H, except for adding the following challenges:
- Joins of derived tables with group by, e.g. comparing popularity of items on consecutive time periods.
- Transitive dimensions - A geographical or tag hierarchy cab ve seen as a dimension table. To get the star schema plan with the selective hash join, the count of the transitive traversal of the hierarchy (hash build side) must be correctly guessed.
- Transitivity in fact table, i.e. average length of reply thread. There the cost model must figure that the reply link is much too high cardinality for hash build side, besides a transitive operation is not a good candidate for a build in multiple passes, hence the plan will have to be by index.
- Graph traversal with condition on end point and navigation step. The hierarchical dimensions and reply threads are in fact trees, the social graph is not. Again the system must know some properties of connectedness (in/out degree, count of vertices) to guess a traversal fanout. This dictates the join type in the step (hash or index). An example is a transitive closure with steps satisfying a condition, e.g. all connected persons have a specific clearance.
- Running one query with parameters from different buckets, implying different best plan.
- Data correlations, e.g. high selectivity arising from two interests seldom occurring together, in places where the correct estimation makes the difference between a good and a bad plan.
- Large intermediate results stored in tables, as in materializing complex summaries of data for use in follow up queries.
- More unions and outer joins.
The idea is to cover the base competences the world has come to expect and to build in challenges to last another 10-15 years.
For rules and metric, we can use the TPC-H or TPC-DS ones as a template. The schema may differ from an implementation of the interactive workload, as these things would normally run on different systems anyway.
As another activity that is not directly LDBC, I will do a merge of SNB and Open Street Map. The geolocated things (persons, posts) will get real coordinates from their vicinity and diverse geo analytics will become possible. This is of some significant interest to Geoknow, another FP7 where OpenLink is participating.
Doing the BI mix and even optimizing the interactive part involves some redoing of the present support for transitivity in Virtuoso. The partitioned group by with some custom aggregates is the right tool for the job, with all parallelization, scale-out ettc ready. You see, TPC-H is very useful also in places one does not immediately associate with it.
As a matter of fact, this becomes a BSP (bulk synchronous processing) control structure. Run any number of steps, each item produces results/effects scattered across partitions. The output of the previous is the input of the next. We might say BSP is an attractor or "Platonic" control structure to which certain paths inevitably lead. Last year I did a BSP implementation in SQL, reading and writing tables and using transactions for serializable update of the border. This is possible but will not compete with a memory based framework and not enough of the optimization potential, e.g. message combining, is visible to the engine in this formulation. So, now we will get this right, as suggested.
So, the transitive derived table construct can have pluggable aggregations, e.g. remembering a path, a minimum length or such), reduction like a scalar valued aggregate (min/max), different grouping sets like in a group by with cube or grouping sets, some group-by like reduction for message combining and so forth. If there is a gather phase that is not just the result of the scatter of the previous step, this can be expressed as an arbitrary database query, also cross
partition in a scale-out setting.
The distributed/partitioned group by hash table will be a first class citizen, like a procedure scoped temporary table to facilitate returning multiple results and passing large data between multiple steps with different vertex operations, e.g. forward and backward in betweenness centrality.
This brings us to the graph analytics proper, which is often done in BSP style, e.g. Pregel, Giraph, Signal-Collect, some but not all Green-Marl applications. In fact, a Green-Marl back end for Virtuoso is conceivable, whether one will be made is a different matter.
With BSP in the database engine, a reference implementation of many standard algorithms is readily feasible and performant enough to do reasonable sizing for the workload and to have a metric. This could be edges ir vertices per unit of time, across a mix of algorithms, for example. Some experimentation will be needed. The algorithms themselves may be had from the Green-Marl sample programs or other implementations. Among others, Oracle would presumably agree that this sort of functionality will in time migrate into core database. We will here have a go at this and along the way formulate some benchmark tasks for a graph analytics workload. Whenever feasible, this will derive from existing work such as graphbench.org but will be adapted to the SNB dataset.
The analytics part will be done with more community outreach than the interactive one. I will blog about the business questions, queries and choke points as we go through them. The interested may pitch in as the matter comes up.