モデルリーダーボードモデルMarketplace評価トレーニングレンタルAPIドキュメント
Language
Your Ad Here

評価スイート

ローカルLLMの品質を評価するためのコミュニティベンチマークスイート。APIから結果を送信してください。

Build eval
Open LLM Leaderboard公式
v1.0 · LM-Eval run

The canonical HuggingFace Open LLM Leaderboard suite: MMLU, ARC Challenge, HellaSwag, WinoGrande, TruthfulQA MC2, and GSM8K with official few-shot settings. Weighted mean aggregate.

reasoning0 件の記録
DROP公式
v1.0 · LM-Eval run

Discrete Reasoning Over Paragraphs. Reading-comprehension benchmark requiring numerical and symbolic reasoning over passages.

reasoning0 件の記録
Big-Bench Hard公式
v1.0 · LM-Eval run

A collection of challenging BIG-Bench tasks selected because prior models performed poorly. Covers symbolic reasoning, algorithmic reasoning, and language understanding.

reasoning0 件の記録
GPQA Diamond公式
v1.0 · LM-Eval run

Graduate-level Google-proof Q&A benchmark focused on biology, physics, and chemistry. The Diamond split is the highest-quality expert-validated subset.

reasoning0 件の記録
WinoGrande公式
v1.0 · LM-Eval run

Large-scale Winograd schema challenge for commonsense reasoning. Fill-in-the-blank pronoun resolution requiring world knowledge.

reasoning0 件の記録
HellaSwag公式
v1.0 · LM-Eval run

Sentence completion benchmark testing grounded commonsense inference. Models must pick the most plausible continuation of an activity description.

reasoning1 件の記録
ARC Challenge公式
v1.0 · LM-Eval run

AI2 Reasoning Challenge (Challenge set) — grade-school science questions that require reasoning beyond simple retrieval. Harder subset of ARC.

reasoning0 件の記録
MMLU公式
v1.0 · LM-Eval run

Massive Multitask Language Understanding — 57-subject academic exam covering STEM, humanities, social sciences, and more. The gold-standard broad-knowledge benchmark.

reasoning1 件の記録
Local Reasoning Mini公式
v1.0 · Custom server-side

A lightweight 10-question sanity check for locally served models. Designed for the trusted /api/evals/execute path.

reasoning3 件の記録