Technical documentation, code, features about Graph and RDF benchmarking
The Social Network Benchmark (SNB) consists of a data generator that generates a synthetic social network, used in three workloads: Interactive, Business Intelligence and Graph Analytics. Currently, only the Interactive Workload has been released in draft stage. A preview of the read-only part of the Business Intelligence Workload is also available.
The main SNB components are:
- https://github.com/ldbc/ldbc_snb_docs The SNB benchmark specification document
- https://github.com/ldbc/ldbc_snb_datagen The data generator exploits parallelism via Hadoop, so you can address huge-sized problems using cluster hardware. Even if you do not posses a cluster, you can use a local install of Hadoop in pseudo-distributed mode to take advantage of multi-core parallelism in a single computer.
- https://github.com/ldbc/ldbc_driver The query driver is used to generate the Interactive Workload, which consists concurrent inserts and read queries, in a certain mix. This program is parallel, though currently only multi-core - a cluster version will be added later. Generating the inserts in parallel is not trivial, since the graph structure is complex and we do not want to insert e.g. a post before a user has registered or before two users have friended (otherwise, referential integrity constraints might get violated). The query driver keeps track of progress of all its parallel clients generating inserts and synchronizes them where necessary.
- https://github.com/ldbc/ldbc_snb_interactive_vendors Vendor-specific driver implementations for the Interactive Workload, provided as examples.
The SNB data generator is a further development of the S3G2 correlated graph generator. It generates a social network with power-law structure that additionally has correlations between values and correlations between structure and value. An example of the former (correlations between attribute values) is that you find that people from a certain country have a distribution of first- and last-names where those typical for that country are more prevalent. An example of correlations between structure and values is that people who studied in the same university or have the same interest are more likely to be friends. Most data volume in the social network is not in the friends graph, but in the posts. These posts contain plausible topic-centered textual data, taken from DBpedia, because the conversations in the discussions read DBpedia pages to each other, paragraph by paragraph. The topics of the discussion are skewed towards the interests of the forum owner (hence also correlated). In all, this data generator is state-of-the-art and has been used e.g. in the SIGMOD 2014 programming contest, which focused on graph analytics.
As post-processing step for the data generator, two steps are performed:
- splitting the generated data at a timepoint: all interactions that took place before that time will be bulkloaded, whereas everything after it will be inserted online as part of the Interactive Workload. The query driver inserts the events in parallel, yet ensures referential integrity, i.e. when adding a friendship edge, it has ensured that the friend already exists in the network. Note that it is not trivial to guarantee this with parallel client sessions and a complex graph structure (one cannot partition the workload among the clients without still having dependencies).
- analyzing the generated data in order to derive parameter bindings for the read queries. In order to produce comparable query plans and query runtimes for the same query with different parameters, we actively look for parameters which lead to similar sized (intermediate) query results. This step is necessary to keep query behavior understandable even though the graph structure of the SNB is complex and irregular and the data distributions are skewed and correlated.