Elements of instance matching benchmarks: a short overview
The number of datasets published in the Web of Data as part of the Linked Data Cloud is constantly increasing. The Linked Data paradigm is based on the unconstrained publication of information by different publishers, and the interlinking of web resources through “same-as” links which specify that two URIs correspond to the same real world object. In the vast number of data sources participating in the Linked Data Cloud, this information is not explicitly stated but is discovered using instance matching techniques and tools. Instance matching is also known as record linkage , duplicate detection , entity resolution  and object identification .
For instance, a search in Geonames (http://www.geonames.org/) for "Athens" would return a resource (i.e., URI) accompanied with a map of the area and information about the place; additional information for the city of Athens can be found in other datasets such as for instance DBpedia (http://dbpedia.org/) or Open Government Datasets (http://data.gov.gr/). To exploit all obtain all necessary information about the city of Athens we need to establish that the retrieved resources refer to the same real world object.
Web resources are published by "autonomous agents" who choose their preferred information representation or the one that best fits the application of interest. Furthermore, different representations of the same real world entity are due to data acquisition errors or different acquisition techniques used to process scientific data. Moreover, real world entities evolve and change over time, and sources need to keep track of these developments, a task that is very hard and often not possible. Finally, when integrating data from multiple sources, the process itself may add new erroneous data. Clearly, these reasons are not limited to problems that did arise in the era of Web Data, it is thus not surprising that instance matching systems have been around for several years .
It is though essential at this point to develop, along with instance and entity matching systems, instance matching benchmarks to determine the weak and strong points of those systems, as well as their overall quality in order to support users in deciding the system to use for their needs. Hence, well defined, and good quality benchmarks are important for comparing the performance of the available or under development instance matching systems. Benchmarks are used not only to inform users of the strengths and weaknesses of systems, but also to motivate developers, researchers and technology vendors to deal with the weak points of their systems and to ameliorate their performance and functionality. They are also useful for identifying the settings in which each of the systems has optimal performance. Benchmarking aims at providing an objective basis for such assessments.
An instance matching benchmark for Linked Data consists of a source and target dataset implementing a set of test-cases, where each test case addresses a different kind of requirement regarding instance matching, a ground truth or gold standard and finally the evaluation metrics used to assess the benchmark.
Datasets are the raw material of a benchmark. A benchmark comprises of a source and target dataset and the objective of an instance matching system is to discover the matches of the two. Datasets are characterized by (a) their nature (real or synthetic), (b) the schemas/ontologies they use, (c) their domains, (d) the languages they are written in, and (e) the variations/heterogeneities of the datasets. Real datasets are widely used in benchmarks since they offer realistic conditions for heterogeneity problems and they have realistic distributions. Synthetic datasets are generated using automated data generators and are useful because they offer fully controlled test conditions, have accurate gold standards and allow setting the focus on specific types of heterogeneity problems in a systematic manner
Datasets (and benchmarks) may contain different kinds of variations that correspond to different test cases. According to Ferrara et.al. , three kinds of variations exist for Linked Data, namely data variations, structural variations and logical variations. The first refers mainly to differences due to typographical errors, differences in the employed data formats, language etc. The second refers to the differences in the structure of the employed Linked Data schemas. Finally, the third type derives from the use of semantically rich RDF and OWL constructs that enable one to define hierarchies and equivalence of classes and properties, (in)equality of instances, complex class definitions through union and intersection among others.
The common case in real benchmarks is that the datasets to be matched contain different kinds (combinations) of variations. On the other hand, synthetic datasets may be purposefully designed to contain specific types (or combinations) of variations (e.g., only structural), or may be more general in an effort to illustrate all the common cases of discrepancies that appear in reality between individual descriptions.
The gold standard is considered as the “correct answer sheet” of the benchmark, and is used to judge the completeness and soundness of the result sets of the benchmarked systems. For instance matching benchmarks employing synthetic datasets, the gold standard is always automatically generated, as the errors (variations) that are added into the datasets are known and systematically created. When it comes to real datasets, the gold standard can be either manually curated or (semi-) automatically generated. In the first case, domain experts manually mark the matches between the datasets, whereas in the second, supervised and crowdsourcing techniques aid the process of finding the matches, a process that is often time consuming and error prone.
Last, an instance matching benchmark uses evaluation metrics to determine and assess the systems’ output quality and performance. For instance matching tools, performance is not a critical aspect. On the other hand, an instance matching tool should return all and only the correct answers. So, what matters most is returning the relevant matches, rather than returning them quickly. For this reason, the evaluation metrics that are dominantly employed for instance matching benchmarks are the standard precision, recall and f-measure metrics.
 Li, C., Jin, L., and Mehrotra, S. (2006) Supporting efficient record linkage for large data sets using mapping techniques. WWW 2006.
 Dragisic, Z., Eckert, K., Euzenat, J., Faria, D., Ferrara, A., Granada, R., Ivanova, V.,Jimenez-Ruiz, E., Oskar Kempf, A., Lambrix, P., Montanelli, S., Paulheim, H., Ritze, D., Shvaiko, P., Solimando, A., Trojahn, C., Zamaza, O., and Cuenca Grau, B. (2014) Results of the Ontology Alignment Evaluation Initiative 2014. Proc. 9th ISWC workshop on ontology matching (OM 2014).
 Bhattacharya, I. and Getoor, L. (2006) Entity resolution in graphs. Mining Graph Data. Wiley and Sons 2006.
 Noessner, J., Niepert, M., Meilicke, C., and Stuckenschmidt, H. (2010) Leveraging Terminological Structure for Object Reconciliation. In ESWC 2010.
 Flouris, G., Manakanatas, D., Kondylakis, H., Plexousakis, D., Antoniou, G. Ontology Change: Classification and Survey (2008) Knowledge Engineering Review (KER 2008), pages 117-152.
 Ferrara, A., Lorusso, D., Montanelli, S., and Varese, G. (2008) Towards a Benchmark for Instance Matching. Proc. 3th ISWC workshop on ontology matching (OM 2008).
 Ferrara, A., Montanelli, S., Noessner, J., and Stuckenschmidt, H. (2011) Benchmarking Matching Applications on the Semantic Web. In ESWC, 2011.