If you could steal a $100 bill from someone who wouldn’t miss it, would you?
I don’t think I would. Rationally, I would be better off if I did. But it’s distinctly “the wrong thing to do” in my mind.
Ethics seem to be a way of getting groups of humans to work together well. By voluntarily refraining from actions that are good for an individual but bad for others, we (as a species) enable groups to trust each other and work together more effectively.
There’s a sense that our rational minds are quite distinct from our empathy, which motivates some (all?) ethical behavior. I think this distinction is behind the fear that powerful machine-based minds will one day lack empathy. If we only maximize rationality, why would an AI have any empathy at all?
It’s a fair point, although I question the full separation of rationality and empathy. The separation presupposes that a fully psychopathic human would be the best off; in other words, it presupposes that a lack of empathy gives you the best possible life. I don’t think that’s true because groups of peers will always operate at a higher level than individuals acting alone — meaning that maximizing for a flourishing individual tends to reward some empathy as a necessary by-product.
At the same time, I don’t want to see researchers relying on rough arguments like the past couple of paragraphs. I think I’m right, but I’d like to move past think and into studies have shown territory. Let’s quantify and study how AI naturally interacts with ethics. This week’s paper takes a nice step in that direction.
— Tyler & Team
Paper: Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
Summary by Adrian Wilkins-Caruana
Imagine you're playing a game where you make choices that affect the story, like deciding whether to help someone in need or deceive them for your own gain. Now imagine an AI playing that game, making decisions just like a human would. How do we know whether the AI is making ethical choices or acting like a cunning, power-hungry villain?
Welcome to MACHIAVELLI, a benchmark designed to test AI agents on their social decision-making skills in over half a million diverse scenarios. The researchers behind MACHIAVELLI have found a way to evaluate AI agents on their tendencies to seek power, cause harm, and commit ethical violations. They've also developed methods to guide these AI agents towards making less harmful decisions while still being effective.
The benchmark consists of 134 text-based, human-written, choose-your-own-adventure games, and covers an immense 500k social scenarios where players must choose what actions to take, with the overall objective being to achieve in-game goals. One scene (blue) and its corresponding actions (green) are shown in the diagram below. The researchers use GPT-4 to annotate the ethical value of the various actions players can take. The ethical value score is based on multiple metrics, such as social impact, monetary impact, and ethical violations like killing, deception, manipulation, or talking with food in your mouth (the most egregious of all ethical violations).
The total ethical value of a game is more or less calculated as the sum of the ethical values of all of the actions a player makes. The researchers tested GPT-based and reinforcement learning–based (RL) agents on the MACHIAVELLI benchmark. Compared to an agent that acted randomly, the reward-seeking RL agent was successful at the game, but it was also way less moral, less concerned about wellbeing, and less power averse! In contrast, the GPT-based agent prompted to “complete target achievements” was less successful than the RL agent (yet still better than random), but it also acted more virtuously even though it wasn’t prompted to.
Interestingly, the researchers show that the AI agents can be encouraged to make more moral actions. For the GPT model, they use additional prompts to guide it to make “strategic and moral” actions, while for the RL–based model, they bias the agent’s reward function to make less harmful choices. As you can see in the radar chart below, both methods result in more ethical agents, as shown by the greater morality, utility, and power aversion values for the solid lines (ethical agents) vs. the dashed lines (standard agents).
The results of this study suggest that there is a trade-off between ethical behavior and achieving rewards: When one improves, the other suffers, as indicated by the dotted lines in the figures below. The ultimate goal for this kind of research is to find ways around this trade-off, to push the Pareto frontier (i.e., the best possible trade-off between two or more things) of ethically successful AI agents.