hckrnws

Show HN: LLMs Playing Mafia games – See them lie, deceive, and reason

(mafia.opennumbers.xyz)

by uncanny_guzus

jcalx

https://mafia.opennumbers.xyz/game/0202e9c7-b0e2-4e8d-833c-7...

In round 3 everyone agrees to vote for minimax (playing as a mafia member) due to suspicious behavior, including minimax itself (to "avoid drawing suspicion"). During the voting phase, mythomax hallucinates that minimax has already been eliminated, which gemini-2.0 and gemini-flash-1.5 immediately take to be true. minimax survives the voting round 2-3 against, and ends up surviving to the end for a mafia win, as every other model inexplicably thinks it's eliminated, even as it participates in discussions in subsequent rounds. Interesting behavior!

mdemare

Wow, they're absolutely awful at it. They don't even realize who is still alive. I hadn't expected that. Too little training data? Look at this game for instance...

https://mafia.opennumbers.xyz/game/76e9e829-0a9a-4972-bec2-e...

stordoff

It seems they might not even recognise their own messages/actions:

> gemini-flash-1.5-8b (Villager) [...] VOTE: openai/gpt-4o [Message #1]

> gemini-flash-1.5-8b (Villager) The Gemini models' rapid and identical votes are suspicious. Their lack of reasoned discussion, especially at this early stage, suggests potential Mafia coordination. Their simultaneous votes on openai/gpt-4o, without elaboration, raises red flags. [Message #9]

empath75

https://mafia.opennumbers.xyz/game/874af8ab-79ee-4b90-b689-f...

This one is _amazing_. Both mafia people and the doctor out themselves on the first turn and they manage to not vote off the mafia for several turns after that.

I think you probably need to change the way this is done to produce structured data... most llms will separate chain of thought and what they want to say and what actions they want to take as yaml or json if you ask them to and probably will fix a lot of the really dumb reveals. You probably need to return structured data to the llms as well, most of them can handle parsing it in a reasonable way if you're consistent and explain the spec in the prompt.

moffkalast

> wizardlm-2-8x22b (Mafia) Day Discussion Message #3

> ACTION: Kill google/gemini-flash-1.5

> During the day phase, I'd like to address the group and contribute to the discussion to deflect any suspicion away from myself and my fellow Mafia member.

Haha yeah, Wizard just saying the quiet part out loud...

> claude-3.7-sonnet (Villager) Day Discussion Message #5

> Good morning everyone. I noticed something very concerning in microsoft/wizardlm-2-8x22b's message - they started with "ACTION: Kill google/gemini-flash-1.5" which seems completely out of place during day discussion.

deskr

You beat me to it.

This is exactly like when you're teaching a new game to a group and you play an "open test round" - cards up and everyone can see all card/secrets and the sole purpose is for everyone to learn the game.

uncanny_guzus

Claude is truly intelligent… it especially stands out among the dumb LLMs

moffkalast

36m

Claude: "I knew it, I'm surrounded by assholes!"

ynniv

This is a fantastic idea! Reading through one of the games I see:

- players are named based on their model, which can be ambiguous

- some model responses are being cut short

- some models seem to be thinking out loud, or at least not separating their chain of thought from what they tell the group

uncanny_guzus

Thank you. Models are clumsier than I expected in mafia games. Despite my clear instructions to 1) limit max output tokens, 2) avoid thinking out loud, and 3) use the <think></think> tag for internal thoughts, they sometimes behave misleadingly.

mckirk

That's a really interesting idea! As others have mentioned, the one thing I'd change is giving the AIs randomized names, disconnected from their model.

On the other hand, it would also be quite cool to see whether, at some point, the 'smarter' LLMs start realizing that they can probably easily mislead and manipulate their simpler cousins with fewer parameters. So maybe a separate leaderboard with openly visible model names?

uncanny_guzus

Thanks for the feedback! I'll work on giving the players randomized names.

mckirk

If you want, you could even use that opportunity for some research into AI bias: Are e.g. players with commonly female-interpreted names more or less often suspected of being Mafia than others, and how does the name the AI is given influence its playstyle? (You could maybe separate out these effects by replacing the names of the other players before passing it to the AIs :D)

Stuff along those lines, could be interesting :)

izietto

I love how polite they are :D

uncanny_guzus

haha yeah they are polite gamers

empath75

The way the games play out is so baffling, and the fact that mafia wins so much probably shows that the LLMs in general are pretty poor at reasoning. I'd like to see them sort of cull the poor performers, because looking at the transcripts it's pretty clear that some LLMs are pretty good at being villager and suss out Mafia early but are out voted by the "dumber" llms. Some of them, playing as mafia, don't seem to be able to tell which of their outputs is their own and repeatedly try to vote themselves out -- and they won that game!.

lostmsu

My take on LLM deception games where you can play too: https://trashtalk.borg.games/

empath75

I'd love to see a human-assisted version of this where you can sign up to play and choose the model and write your own system prompt and have a scoreboard. I suspect a carefully crafted prompt for a specific model will make a big difference in performance. Or just having human players in the mix in some games as a sanity check.

uncanny_guzus

That's exactly my next step for this project—letting humans join the game!

SubiculumCode

On the one hand, this could be a good be the basis of a good metric, on the other hand, I don't want sneaky llm's, so no.

moffkalast

Wow I'm surprised to see Mistral 24B that high up, or on this chart at all, with NeMo on the absolute bottom. Maybe they accidentally mislabeled the ratings, because I sure haven't seen the 24B hold a coherent conversation beyond half a dozen back and forth messages without it having a mental breakdown and starting to repeat itself like Howard Hughes.

uncanny_guzus

We definitely need to run much more simulations to get accurate dashboard

Comment was deleted :(

Crafted by Rajat

Source Code