A fundamental concern in the field of Information Retrieval (IR) is evaluation. The most direct approach to evaluating an IR system is to rely on user experiments (so-called "interactive evaluation" approach), but this is unrealistic because too many users should be involved.
Since the 1970s, the main method of evaluation of IR systems still depends on the "Cranfield paradigm" [Harman 2010], used for test collections such as TREC. These test collections include a set of documents (corpus), a set of queries, and the judgments of relevance of the documents to these queries (assessments). Queries are chosen and written by experts, and relevance judgments are also evaluated by experts. The evaluation of an IR system consists of submitting a set of queries to the IR system and then evaluating the results returned according to certain criteria, taking the judgment of the experts as the ground truth. Conceptually, this approach replaces users with a subset of them (the experts) working on a subset of queries.
While such approach is relevant and allows for accurate evaluation and comparison of IR systems, it can not be a complete substitute for true interactive assessments. Indeed, some specific aspects of human behavior are not taken into account [Harman 2010], especially in the context of web-based research, because: (1) the actual relevance estimate of documents by users is a two-step sequential process where a user first reviews the results page and then consults the corresponding documents. It is therefore difficult to make the link between the quality measures of an IR system calculated by the experts and its real contribution to the user; (2) only a few excerpts, usually between the top 3 and 5, are actually taken into account by a user when evaluating the answers to his query on a results page [Baeza-Yates 2018]. Some evaluation measures of IR systems (eg nDCG) explicitly use ranks in their expression, but other elements should be integrated (e.g., elements highlighted, freshness of the document).
Other approaches take the opposite side, and replace expert judgments with the (simplified) judgments of a very large number of users [Chuklin et al. 2013] in order to better integrate the user into the evaluation loop. In practice, it is a question of analyzing the selection made by the users in the result pages in order to deduce judgments of relevance, based on the assumption that the selection of a document implies its relevance. These approaches require significant resources and a very large number of users, either via search engines (e.g. Google, Qwant) logs or via the use of micro-tasks (eg Amazon Mechanical Turk, Crowdflower ) that are beyond the reach of most research teams. In addition, judgments are of low reliability because what is analyzed (mainly "clicks") is simplistic. Thus, documents not selected by the user, whether read (that is, evaluated) or not by the user, are treated in the same way, as irrelevant documents. In addition, this approach lacks explanation because the reasons why the user has judged a document relevant (or not), i.e. the terms read, are not known.
In summary, the limitations of the current approaches are their lack of consideration of the central role of the extracts in assessing the relevance of the results, as well as the behavioral strategies implemented by the users to evaluate the results.
Our project aims to synthesize the two approaches described above (judgments via experts or via users) and, based on the current test collections, to propose an approach whose reliability level will be close to the interactive evaluation via user tests.
Even if the approach that we propose is generic and applies in principle to any IR system, the angle of study that we choose to address in this project focuses on interaction: evaluate the relevance feedback assisted by eye movement analysis for information retrieval [Albarede et al. 2019]. The evaluation of these systems needs to take into account the behavioral strategy of the users when they evaluate the pages results and the snippets they contain. We will build on top of the demonstrator already realized during a previous project [Sungeelee et al. 2020].
Our goal is to generate a population of synthetic users able to simulate the behavior of real users when searching for information on the web, and taking as queries the queries from a test collection. It is a question of replacing the classic interactive evaluations carried out with users, experiments carried out by the simulation of the behavior of these users. This approach addresses two bottlenecks of conventional interactive assessments. In the first place, the number of (synthetic) users by the experiments is no longer limited. Secondly, the behavior of each synthetic user is reproducible in the same way. However, this approach raises other issues, the most important of which is to ensure the validity of the approach, or in other words that the synthetic user population is, in the context of IR, consistent with the behavior. real users.
To ensure this validity, we propose an approach where the human user is always present, but his role in the experimental protocol is different. Our proposal consists of working in two stages: first, a modeling of human behavior is carried out in the context of the targeted system via experiments in the presence of real users in limited numbers; in a second step, this modeling is used to test the IR system via simulations of the human behavior obtained by realizing variations from this modelization.
For this, it is necessary to model how the user evaluates the results page globally, then individually each extract of it, and as a corollary, to define how this one evaluates the relevance of an extract, in particular through the terms contained in it. Here is the second bottleneck of the approach. Indeed, the relevance of an excerpt can be seen from two points of view: the relevance of the document resulting from expert evaluations (called intrinsic relevance) and the relevance perceived by the user by reading the terms contained in the extract (called perceived relevance) [He et al. 2012]. If the first is part depends on the test collections, the second is dependent on the generation of snippets [Chuklin and Rijke 2014 ]. One of the solutions is to use experts to assess their perceived relevance and then the terms that have led to this evaluation, but this approach is not realistic in terms of the number of documents in a test collection, so it must be determined automatically.
In summary, the main expected contributions of this project are: (1) a behavioral modeling of the user of an IR system with several levels of abstraction (the results page and the extract) and (2) a method to determine the perceived relevance of an extract generated from a test collection.
L. Albarede, F. Jambon and P. Mulhem. 2019. Exploration de l’apport de l’analyse des perceptions oculaires : étude préliminaire pour le bouclage de pertinence. CORIA 2019, Lyon, France. http://www.asso-aria.org/coria/2019/CORIA_2019_paper_1.pdf
R. Baeza-Yates. 2018. Bias on the web. Communication of the ACM 61, 6 (May 2018), 54-61.
A. Chuklin, P. Serdyukov, and M. de Rijke. 2013. Click model-based information retrieval metrics. In Proc. of ACM SIGIR '13, 493-502.
A. Chuklin and M. de Rijke. 2014. The Anatomy of Relevance: Topical, Snippet and Perceived Relevance in Search Result Evaluation. ACM SIGIR'14 Workshop on Gathering Efficient Assessments of Relevance.
D. Harman. 2010. Is the cranfield paradigm outdated? ACM SIGIR '10. Keynote.
J. He, P. Duboue, and J.-Y. Nie. 2012. Bridging the gap between intrinsic and perceived relevance in snippet generation. In Proc. of COLING 2012, 1129--1146.
V. Sungeelee, F. Jambon, P. Mulhem. Proof of Concept and Evaluation of Eye Gaze Enhanced Relevance Feedback in Ecological Context. Proceedings of the Joint Conference of the Information Retrieval Communities in Europe (CIRCLE 2020), Samatan, Gers, France, July 6-9, 2020. https://www.irit.fr/CIRCLE/wp-content/uploads/2020/06/CIRCLE20_04.pdf