Eval Suites

Community benchmark suites for evaluating local LLM quality. Submit results via the API.

Competition math problems spanning algebra, counting, geometry, intermediate algebra, number theory, prealgebra, and precalculus.

math0 runs

GSM8KOfficial

v1.0 · LM-Eval run

Grade School Math 8K — 8,500 grade-school math word problems requiring multi-step arithmetic reasoning. Standard benchmark for math reasoning capability.

math3 runs