Author here - I'm planning to create game versions of this benchmark, as well as my other multi-agent benchmarks (https://github.com/lechmazur/step_game, https://github.com/lechmazur/pgg_bench/, and a few others I'm developing). But I'm not sure if a leaderboard alone would be enough for comparing LLMs to top humans, since it would require playing so many games that it would be tedious. So I think it would be just for fun.
I was inspired by your project to start making similar multi-agent reality simulations. I’m starting with the reality game “The Traitors” because it has interesting dynamics.
If you watch the top tier social deduction players on YouTube (things like Blood on the Clocktower etc), they’d figure out weaknesses in the LLM and exploit it immediately.
I'm interested in seeing how the LLMs react to some specific defined strategies. E.g. an "honest" bot that says "I'm voting for player [random number]." and does it every round (not sure how to handle the jury step). Do they decide to keep them around for longer, or eliminate them for being impossible to reason with if they pick you?
Yes, predefined strategies are very interesting to examine. I have two simple ones in another multi-agent benchmark, https://github.com/lechmazur/step_game (SilentGreedyPlayer and SilentRandomPlayer), and it's fascinating to see LLMs detect and respond to them. The only issue with including them here is that the cost of running a large set of games isn't trivial.
Another multi-agent benchmark I'm currently developing, which involves buying and selling, will also feature many predefined strategies.