8th TUC Meeting - Weining Qian (ECNU). On Statistical Characteristics of Real-Life Knowledge Graphs.
Weining Qian, professor at East China Normal University presented his talk on Statistical Characteristics of Real-Life Knowledge graphs during the 8th TUC Meeting held at Oracle’s facilities in Redwood City, California.
Qian explained that term knowledge graph was introduced by Google in 2012 and it has been an evolution of the semantic web. Professor Qian then introduced the main question of his talk: how can we efficiently manage knowledge graphs? Are the existing benchmarks sufficient to test them since most of these benchmarks focus only on Social Networks?
In order to answer this question, he clearly differentiates Social Networks as opposed to Knowledge Graphs. The main differences between them are the higher number of semantic labels in entities and relations on the KG, the fact that they are topic sensitive and hard to define under a unified schema. Thus the reason of his study, to achieve better understanding of KGs, help selection of seeding datasets in benchmarks and to help the development of data generators.
Qian and his team tested 4 different real-life knowledge graphs:
- YAGO2, a huge semantic knowledgra graph based on WordNet, Wikipedia and GeoNames that contains >10M entities and >120M facts. To conduct the study they separated YAGO2 into 3 sub-graphs (YagoTax, YagoFact and YagoWiki).
- WordNet, a lexical network for the English language. It contains 98000 entities and 154000 relationships where a synonym is a node and a semantic relation an edge.
- DBpedia, similar to YAGO2’s Fact subgraph. It contains 4.58M things and 2795 different kinds of properties.
- Enterprise Knowledge Graph (EKG). It describes enterprise relationships in Chinese and it has been extracted from reports from enterprises in Shanghai Stock Market. It’s domain specific and contains 51853 entities and 430973 relationships (7 types of them).
They also included in the study 2 Social Networks:
- SNRand. The team selected 0.2M users (randomly selected) and 5M fellowship relations between users.
- SNRank. Again they selected 0,2M most active users and >36M fellowship relations between users.
Conclusions from the study are that both triple stores and relational databases have reasons to be used in benchmarking KGs, however, the key is to avoid joins over power-law distributed data.
The 9th TUC Meeting will be held at at SAP's HQ in Walldorf, Germany the 9-10th of February. Start planning your assistance!
As always the whole presentation and his slides are shown below: