TurtleBench is a dynamic evaluation benchmark designed to assess the reasoning capabilities of large language models (LLMs) through real-world yes/no puzzles, emphasizing logical reasoning over ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results