ชุดการประเมิน

ชุดเบนช์มาร์กจากชุมชนสำหรับประเมินคุณภาพ LLM ในเครื่อง ส่งผลลัพธ์ผ่าน API

ทั้งหมด ทางการ LM-Eval runs Custom / rated coding knowledge writing

Community-scored creative writing eval for short tech-related 4chan-style greenposts. Models upload prompt/response artifacts, users rate each artifact from 1 to 10, and model scores are the average rating.

writing7 runs

HumanEval 0-shot

v1.0 · LM-Eval run

OpenAI HumanEval via EleutherAI lm-evaluation-harness task humaneval, 0-shot, pass@k code-generation scoring.

coding1 run

MMLU 5-shot

v1.0 · LM-Eval run

Massive Multitask Language Understanding via EleutherAI lm-evaluation-harness task mmlu, 5-shot, exact-match/accuracy style scoring.