Titre : Hypotheses-based Stacking of Deep Learning and non-Deep Learning Information Retrieval

Sujet proposé dans : M2 MOSIG, Projet --- M2 MSIAM, Projet

Responsable(s) :

Mots-clés : Information Retrierval, Deep Learning IR models
Durée du projet : 5 mois
Nombre maximal d'étudiants : 1
Places disponibles : 1
Interrogation effectuée le : 25 avril 2024, ŕ 07 heures 04


Description

Currently, many state of the art IR approaches stack multiple retrieval processes. For instance, the best performing models at the MSMARCO Deep Learning Track [1] rely on reranking, i.e. the first retrieval stage is done on a large corpus composed of millions of documents, and then the second retrieval stage is applied on the top-results of the first retrieval. Implicitly, such stacked retrieval process is based on the following hypotheses:

H1-Efficiency: Some retrieval models are much faster, and thus much more able to process large sets of documents in a reasonable time, than others ;

H2- Effectiveness: Some retrieval models perform better than others.

With respect to these hypotheses, stacking employs efficient and somewhat effective models in its first retrieval stage, and the most effective systems in the second stage, to achieve good results on large sets of documents. Of course, a simple sequence stacking is not the only possible combination being applied.  Different retrieval approaches may be applied in parallel and merged, as we did in previous experiments [2]. Moreover, these models may be combined with the recent approaches such as Colbert [3] and Unicoil [4], which tend to be efficient while still being highly effective.  

The goal of this internship is to define a framework able to propose stackings generation based on the features of the models considered, and then experiment the proposed stacking to verify the improvement achieved.

Formally, the goals of this work is thus following:

i) Formulate a set of hypotheses which may define a stacked retrieval

ii) Formalize the stacking processes according to the hypotheses formulated in i)

iii) Perform experiments with different stacking retrieval setups using several common IR test collections, such as TREC-DL 2021 [5].

Supervision by Petra Galuscakova and Philippe Mulhem

[1] Yixuan Qiao, Hao Chen, Yongquan Lai, Jun Wang, Tuozhen Liu, Xianbin Ye, Rui Fang, Peng Gao, Wenfeng Xie, and Guotong Xie: PASH at TREC 2021 Deep Learning Track: Generative Enhanced Model for Multi-stage Ranking, arXiv.2205.11245v2, 2021.

[3] Petra Galuscakova, Lucas Alberede, Gabriela Nicole Gonzalez Saez, Aidan Mannion, Philippe Mulhem, Georges Quénot, Université Grenoble Alpes at TREC Deep Learning Track 2022 (to be published).

[3] Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '20). Association for Computing Machinery, New York, NY, USA, 39–48.

[4] Jimmy Lin and Xueguang Ma:  Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques, arXiv.2106.14807v1, 2021.

[5] Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos and Jimmy Lin: Overview of the TREC 2021 deep learning track, Text REtrieval Conference (TREC), 2021.