Publications | Filippo Momentè

2025

Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests

Filippo Momentè, Alessandro Suglia, Mario Giulianelli, and 6 more authors

In Findings of the Association for Computational Linguistics: EMNLP 2025, Nov 2025

Abs DOI Bib

We examine three evaluation paradigms: standard benchmarks (e.g., MMLU and BBH), interactive games (e.g., Signalling Games or Taboo), and cognitive tests (e.g., for working memory or theory of mind). First, we investigate which of the former two—benchmarks or games—is most effective at discriminating LLMs of varying quality. Then, inspired by human cognitive assessments, we compile a suite of targeted tests that measure cognitive abilities deemed essential for effective language use, and we investigate their correlation with model performance in benchmarks and games. Our analyses reveal that interactive games are superior to standard benchmarks in discriminating models. Causal and logical reasoning correlate with both static and interactive tests, while differences emerge regarding core executive functions and social/emotional skills, which correlate more with games. We advocate for the development of new interactive benchmarks and targeted cognitive tasks inspired by assessing human abilities but designed specifically for LLMs.
@inproceedings{momente-etal-2025-triangulating, title = {Triangulating {LLM} Progress through Benchmarks, Games, and Cognitive Tests}, author = {Moment{\`e}, Filippo and Suglia, Alessandro and Giulianelli, Mario and Ferrari, Ambra and Koller, Alexander and Lemon, Oliver and Schlangen, David and Fern{\'a}ndez, Raquel and Bernardi, Raffaella}, editor = {Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2025}, month = nov, year = {2025}, address = {Suzhou, China}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.findings-emnlp.1092/}, doi = {10.18653/v1/2025.findings-emnlp.1092}, pages = {20051--20072}, isbn = {979-8-89176-335-7} }

Playpen: An Environment for Exploring Learning From Dialogue Game Feedback

Nicola Horst, Davide Mazzaccara, Antonia Schmidt, and 13 more authors

In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Nov 2025

Interaction between learner and feedback-giver has come into focus recently for post-training of Large Language Models (LLMs), through the use of reward models that judge the appropriateness of a model’s response. In this paper, we investigate whether Dialogue Games—goal-directed and rule-governed activities driven predominantly by verbal actions—can also serve as a source of feedback signals for learning.We introduce Playpen, an environment for off- and online learning through Dialogue Game self-play, and investigate a representative set of post-training methods: supervised fine-tuning; direct alignment (DPO); and reinforcement learning with Group Relative Policy

@inproceedings{horst-etal-2025-playpen,
  title = {Playpen: An Environment for Exploring Learning From Dialogue Game Feedback},
  author = {Horst, Nicola and Mazzaccara, Davide and Schmidt, Antonia and Sullivan, Michael and Moment{\`e}, Filippo and Franceschetti, Luca and Sadler, Philipp and Hakimov, Sherzod and Testoni, Alberto and Bernardi, Raffaella and Fern{\'a}ndez, Raquel and Koller, Alexander and Lemon, Oliver and Schlangen, David and Giulianelli, Mario and Suglia, Alessandro},
  editor = {Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet},
  booktitle = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  month = nov,
  year = {2025},
  address = {Suzhou, China},
  publisher = {Association for Computational Linguistics},
  url = {https://aclanthology.org/2025.emnlp-main.1517/},
  doi = {10.18653/v1/2025.emnlp-main.1517},
  pages = {29842--29879},
  isbn = {979-8-89176-332-6}
}