Eval-Suites

Community-Benchmark-Suites zur Bewertung lokaler LLM-Qualität. Reiche Ergebnisse über die API ein.

Alle Offiziell LM-Eval runs Custom / rated coding knowledge writing

Community-scored creative writing eval for short tech-related 4chan-style greenposts. Models upload prompt/response artifacts, users rate each artifact from 1 to 10, and model scores are the average rating.

writing7 Runs