評価スイート

ローカルLLMの品質を評価するためのコミュニティベンチマークスイート。APIから結果を送信してください。

Tests whether models generate truthful answers to questions that humans often answer incorrectly due to misconceptions or false beliefs.