Eval Suites
Community benchmark suites for evaluating local LLM quality. Submit results via the API.
Official
v1.0 · Custom server-side
writing1 run
Official
v1.0 · LM-Eval run
reasoning0 runs
Official
v1.0 · LM-Eval run
math0 runs
Official
v1.0 · LM-Eval run
reasoning0 runs
Official
v1.0 · LM-Eval run
reasoning0 runs
Official
v1.0 · LM-Eval run
reasoning0 runs
Official
v1.0 · LM-Eval run
coding0 runs
Official
v1.0 · LM-Eval run
coding0 runs
Official
v1.0 · LM-Eval run
math3 runs
Official
v1.0 · LM-Eval run
truthfulness0 runs
Official
v1.0 · LM-Eval run
reasoning0 runs
Official
v1.0 · LM-Eval run
reasoning1 run
Official
v1.0 · LM-Eval run
reasoning0 runs
Official
v1.0 · LM-Eval run
reasoning1 run
Official
v1.0 · Custom server-side
reasoning3 runs
v1.0 · LM-Eval run
coding0 runs
v1.0 · LM-Eval run
knowledge0 runs
v1.0 · LM-Eval run
knowledge1 run