Master MI, UJF/UFRIMA2G INP/Ensimag, gestion de projets étudiants

Titre : Link key extraction under ontological constraints

Sujet proposé dans : M2 MOSIG, Projet --- M2 MSIAM, Projet --- M2R Informatique, Projet

Responsable(s) :

Jérôme David (jerome.david@inrialpes.fr) LIG-INRIA-EXMO

Mots-clés : linked data, RDF, identification, ontology, OWL, entity resolution, key, formal concept analysis, relational concept analysis, semantic web
Durée du projet : 5 month, can be continued in PhD
Nombre maximal d'étudiants : 1
Places disponibles : 0
Interrogation effectuée le : 25 avril 2024, à 22 heures 04

Description

Identifying equivalent resources across different data sets allows reasoning with the information in both data sets. It is possible to extract, from the two data sets alone, link keys which, in turns, can directly find links between such resources. We aim at improving such link key extraction techniques so that they take advantage of ontologies describing the data sets.

The goal of the semantic web is to take advantage of formalised knowledge at the scale of the worldwide web. This has led to the release of a vast quantity of data expressed in semantic web formalisms (RDF) [Heath 2011a]. Part of the added value of linked data lies in the links identifying the same entity in different data sets as it allows for making inference between data sets. For instance, they may identify the same books and articles in different bibliographical data sources. So finding the manifestation of the same entity across several data sets is an important task of linked data.

One way of identifying entities is to use link keys which are a generalisation of keys usually found in data bases to several data sets. A link key [Atencia 2014b] is a statement such as:

{<auteur, author>} {<titre, title>} linkkey <Livre, Novel>

stating that an instance of the class Livre is equivalent to an instance of the class Novel as soon as their property auteur and author share at least one value and their property titre and title have the same values. It is clear that, from such link keys, links between instances can be automatically generated.

We have taken advantage of techniques to extract concepts between two interdependent ordered sets, namely Formal concept analysis (FCA [Ganter 1999a]) and Relational Concept Analysis (RCA [Hacene 2013a]), in order to extract link key candidate across two data sets [Atencia 2014d, Vizzini 2017a].

Although such processes have been found very accurate, they do not take into account the ontologies. The goal of this master topic is to take advantage of the ontology content and the ontology language in order to improve the extracted link keys.

Indeed, it may be that a link key condition is more general than another and that the system does not take advantage of it. This is the case if the link key is:

{<auteur, author>, <auteur, creator>} {<titre, title>} linkkey <Livre, Novel>

If creator is a more general property than author, then the link key can be reduced to the formerly provided one, since if two instances have an auteur/author in common, they necessarily have a auteur/creator in common. The subsumption between classes, i.e., Novel is subsummed by Book, may also be used for extracting specific link keys.

In addition, the lack of expressivity of considered link keys may forbid to find relevant ones. In particular if a data set uses a relation auteur and another hasWritten, then these properties are not comparable. So, our extraction techniques are ineffective. However, hasWritten and writtenBy are inverse. It would be useful to take advantage of such pairs in order to find better link keys, i.e., finding:

{<auteur, hasWritten^-1>} {<titre, title>} linkkey <Livre, Novel>

This can also be applied to the generation of paths in link keys. For instance, one data set may have a reference to a Person object while another one may register the authorname. Hence, the relevant link key should rather be:

{<auteur . nom, authornamel>} {<titre, title>} linkkey <Livre, Novel>

Of course, all these operations may increase the complexity of the link key extraction process. Hence, their introduction will have to be progressive and carefully studied.

Expected results

Defining strategies for taking advantage of ontologies in the link key extraction process;
Defining strategies for progressively extending the extracted link key expressivity;
Implementation and tests in existing extraction software.

This work is related to the Elker ANR project for which a PhD position is available.

References:

[Atencia 2014b] Manuel Atencia, Jérôme David, Jérôme Euzenat, Data interlinking through robust link key extraction, Proc. 21st ECAI, Prague (CK), pp15-20, 2014
[Atencia 2014d] Manuel Atencia, Jérôme David, Jérôme Euzenat, What can FCA do for database link key extraction?, Proc. ECAI workshop on "what can FCA do for AI?", Prague (CK), 2014
[Ganter 1999a] Bernhard Ganter, Rudolf Wille, Formal concept analysis: mathematical foundations, Springer, Berlin (DE), 1999
[Heath 2011a] Tom Heath and Christian Bizer, Linked Data: Evolving the Web into a Global Data Space, Morgan & Claypool, 2011
[Hacene 2013a] Mohamed Rouane Hacene, Marianne Huchard, Amedeo Napoli, Petko Valtchev, Relational concept analysis: mining concept lattices from multi-relational data, Annals of Mathematics and Artificial Intelligence
[Vizzini 2017a] Jérémy Vizzini, Data interlinking with relational concept analysis, Mémoire de Master Informatique, option Data science, Université Grenoble Alpes, 2017

Links:

mOeX web site: http://moex.inria.fr
more information: http://moex.inria.fr/training/2017-M2R-ontolinkkey.html