用於評估本地LLM品質的社群基準測試套件。透過API提交結果。
Tests whether models generate truthful answers to questions that humans often answer incorrectly due to misconceptions or false beliefs.