Conversational Agents (CA) as frontends to Information Retrieval (IR) and Recommender Systems (RS) become more popular in everyday life, with a wider range of users and usages. The latest developments in Large Language Models (LLMs) will have tremendous consequences, especially for the workplace and education. In this Dagstuhl Perspectives Workshop, we want to focus on the evaluation of these conversational systems, as appropriate methods are still missing. The quality of these systems is limited in terms of personalization, veracity and correctness, bias, transparency, trustworthiness, and understandability. Thus, evaluation methods must address these shortcomings. Furthermore, user- and usage-oriented aspects should become a more prominent and integral component in evaluations, as the user population as well as the tasks these systems are used for become more heterogeneous. For this reason, the topic-centric view of relevance has to be extended to a broad range of facets which are important for the different usage scenarios. Therefore, suitable evaluation criteria have to be specified, which form the basis for defining appropriate measures. Most importantly, the range of evaluation methods must be revisited and extended, as popular methods like the Cranfield approach or crowdsourcing must be complemented by new evaluation methods and strategies specifically tailored to this new type of system.
More in detail, we will focus our discussion on several key open issues, among which are the following:
Overall, all the above questions call, as one possible output of the workshop, for envisioning some reference architecture for CA systems, geared towards evaluation, which allows the different areas to cooperate on a common ground and to share a common roadmap for improving our understanding of CA systems and making them more effective.