Technical documentation, code, features about Graph and RDF benchmarking
The Semantic Publishing Benchmark v2.0 (SPB) is a LDBC benchmark for RDF database engines inspired by the Media/Publishing industry, particularly by the BBC’s Dynamic Semantic Publishing approach.
The application scenario considers a media or a publishing organization that deals with large volume of streaming content, namely news, articles or "media assets". This content is enriched with metadata that describes it and links it to reference knowledge – taxonomies and databases that include relevant concepts, entities and factual information. This metadata allows publishers to efficiently retrieve relevant content, according to their various business models. For instance, some, like the BBC, can use it to maintain rich and interactive web-presence for their content, while others, e.g. news agencies, would be able to provide better defined content feeds, etc.
From a technology standpoint, the benchmark assumes that an RDF database is used to store both the reference knowledge (mostly static) and the metadata (that grows constantly, to stay in synch with the inflow of streaming content). The main interactions with the repository are (i) updates, that add new metadata or alter it, and (ii) queries, that retrieve content according to various criteria.
New features of SPB 2.0:
- Larger sizes of Reference Data - added reference data entities from DBpedia (Companies, Events, Persons)
- Largerer amount of Geonames locations - added geonames ids of locations around all Europe
- Added owl:sameAs mappings between geonames ids and DBpedia locations
- Add two new queries in the basic interactive query-mix, querying the relations between entities in reference data
- Requires inference support for (RDFS - subPropertyOf, subClassOf, OWL - TransitiveProperty, SymmetricProperty, sameAs)
SPB consists of a Data Generator for producing synthetic data, a Query Driver which offers two workloads: basic and advanced and a set of real reference knowledge data and ontologies provided by The BBC, DBpedia and GeoNames.
Main components of the benchmark software are:
- https://github.com/ldbc/ldbc_spb_bm_2.0/blob/master/doc The SPB documentation including a Full Disclosure Report template
- https://github.com/ldbc/ldbc_spb_bm_2.0 Contains as one integral unit all necessary components for the benchmark software :
- Data generator which can produce consistent data in parallel and at different scales allowing for experimenting with various scales sizes
- Query driver which executes both workloads:
- basic - consisting of an interactive query-mix for evaluation RDF systems in most common use-cases
- advanced - consisting of interactive and analytical query-mixes, adding additional complexity to the query workload e.g. faceted, analytical and drill-down queries
- Reference datasets and Ontologies - a set of reference data and ontologies provided by The BBC, DBpedia and used in the process of generating the synthetic data
- Validation of query results used to validate query results from both workloads
The SPB data generator produces scalable in size synthetic large data. Synthetic data consists of a large number of annotations of media assets that refer entities found in reference datasets. An annotation (also called creative work) can be defined as a meta-data about a real entity or entities. Meta-data consists of various properties e.g. description, date of creation, tagged entities etc. The data generator models three types of relations in produced synthetic data :
- Clustering of data: clustering effect is produced by generating creative works about a single entity for a period of time. The number of creative works starts at a high peak and follows a smooth decay
- Correlations of entities: correlations are produced by generating creative works about two (three) entities from reference data for a period of time. Each of the entities is tagged by creative works solely at the beginning and end of correlation period, while at the middle of it all entities are tagged together
- Random tagging of entities: random distribution of tagged entities are created thus simulating random ‘noise’ in generated data
The SPB Data generator can generate data at various scales defined by the benchmark user starting from 1M triples to Billions. Generated data is saved to files with proper RDF serialization format and split in chunks also defined by the user.
Generated synthetic data can be loaded in the benchmarked RDF system either by using the test driver or manually. Once loaded, various statistics about loaded data are analyzed and query substitution parameters are generated (and saved to files)
The SPB Query driver starts the workloads by simultaneous execution of two types of agents : editorial (executing insert/update/delete operations) and aggregation agents (executing select/construct/describe operations). All agents run in parallel thus simulating a real multi-user exploitation of the RDF system under test.