I ran three more of my independent benchmarks:

- Improves upon GPT-4o's score on the Short Story Creative Writing Benchmark, but Claude Sonnets and DeepSeek R1 score higher. (https://github.com/lechmazur/writing/)

- Improves upon GPT-4o's score on the Confabulations/Hallucinations on Provided Documents Benchmark, nearly matching Gemini 1.5 Pro (Sept) as the best-performing non-reasoning model. (https://github.com/lechmazur/confabulations)

- Improves upon GPT-4o's score on the Thematic Generalization Benchmark, however, it doesn't match the scores of Claude 3.7 Sonnet or Gemini 2.0 Pro Exp. (https://github.com/lechmazur/generalization)

I should have the results from the multi-agent collaboration, strategy, and deception benchmarks within a couple of days. (https://github.com/lechmazur/elimination_game/, https://github.com/lechmazur/step_game and https://github.com/lechmazur/goods).