Lila: A Unified Benchmark for Mathematical Reasoning

Mishra, Swaroop; Finlayson, Matthew; Lu, Pan; Tang, Leonard; Welleck, Sean; Baral, Chitta; Rajpurohit, Tanmay; Tafjord, Oyvind; Sabharwal, Ashish; Clark, Peter; Kalyan, Ashwin

Computer Science > Computation and Language

arXiv:2210.17517 (cs)

[Submitted on 31 Oct 2022 (v1), last revised 8 Mar 2023 (this version, v2)]

Title:Lila: A Unified Benchmark for Mathematical Reasoning

Authors:Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, Ashwin Kalyan

View PDF

Abstract:Mathematical reasoning skills are essential for general-purpose intelligent systems to perform tasks from grocery shopping to climate modeling. Towards evaluating and improving AI systems in this domain, we propose LILA, a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions: (i) mathematical abilities e.g., arithmetic, calculus (ii) language format e.g., question-answering, fill-in-the-blanks (iii) language diversity e.g., no language, simple language (iv) external knowledge e.g., commonsense, physics. We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs, thereby obtaining explainable solutions in addition to the correct answer. We additionally introduce two evaluation datasets to measure out-of-distribution performance and robustness to language perturbation. Finally, we introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA. Importantly, we find that multi-tasking leads to significant improvements (average relative improvement of 21.83% F1 score vs. single-task models), while the best performing model only obtains 60.40%, indicating the room for improvement in general mathematical reasoning and understanding.

Comments:	EMNLP 2022
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
MSC classes:	68T50
ACM classes:	I.2.7
Cite as:	arXiv:2210.17517 [cs.CL]
	(or arXiv:2210.17517v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2210.17517

Submission history

From: Matthew Finlayson [view email]
[v1] Mon, 31 Oct 2022 17:41:26 UTC (15,860 KB)
[v2] Wed, 8 Mar 2023 16:47:46 UTC (15,878 KB)

Computer Science > Computation and Language

Title:Lila: A Unified Benchmark for Mathematical Reasoning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Lila: A Unified Benchmark for Mathematical Reasoning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators