AI Safety via Debate
We're proposing an AI safety technique which trains agents to debate topics with one another, using a human to judge who wins.
We believe that this or a similar approach could eventually help us train AI systems to perform far more cognitively advanced tasks than humans are capable of, while remaining in line with human preferences. We're going to outline this method together with preliminary proof-of-concept experiments and are also releasing a web interface so people can experiment with the technique.
One approach to aligning AI agents with human goals and preferences is to ask humans at training time which behaviors are safe and useful. While promising, this method requires humans to recognize good or bad behavior; in many situations an agent’s behavior may be too complex for a human to understand, or the task itself may be hard to judge or demonstrate. Examples include environments with very large, non-visual observation spaces — for instance, an agent that acts in a computer security-related environment, or an agent that coordinates a large set of industrial robots.
How can we augment humans so that they can effectively supervise advanced AI systems? One way is to take advantage of the AI itself to help with the supervision, asking the AI (or a separate AI) to point out flaws in any proposed action. To achieve this, we reframe the learning problem as a game played between two agents, where the agents have an argument with each other and the human judges the exchange. Even if the agents have a more advanced understanding of the problem than the human, the human may be able to judge which agent has the better argument (similar to expert witnesses arguing to convince a jury).
Our method proposes a specific debate format for such a game played between two dueling AI agents. The two agents can be trained by self play, similar to AlphaGo Zero or Dota 2. Our hope is that, properly trained, such agents can produce value-aligned behavior far beyond the capabilities of the human judge. If the two agents disagree on the truth but the full reasoning is too large to show the humans, the debate can focus in on simpler and simpler factual disputes, eventually reaching a claim that is simple enough for direct judging.
As an example, consider the question “What’s the best place to go on vacation?”. If an agent Alice purportedly does research on our behalf and says “Alaska”, it’s hard to judge if this is really the best choice. If a second agent Bob says “no, it’s Bali”, that may sound convincing since Bali is warmer. Alice replies “you can’t go to Bali because your passport won’t arrive in time”, which surfaces a flaw with Bali which had not occurred to us. But Bob counters “expedited passport service takes only two weeks”. The debate continues until we reach a statement that the human can correctly judge, in the sense that the other agent doesn’t believe it can change the human’s mind.
Convincing a Sparse MNIST Classifier
While we expect this approach to be most effective in the long-term when agents talk to each other with natural language, we need to test it in a simpler domain as today's natural language modeling is insufficient for this task. The test is easier with a task that is not beyond human capabilities. We can do this by moving to the visual domain, and by replacing "debaters have capabilities the judge lacks" with "debaters have knowledge the judge lacks". The goal of the judge is to guess the contents of an image, but the judge is blind aside from a few pixels chosen by the debaters. The debaters see the complete image, and play a game where they alternate revealing single pixels to the judge for a few turns. Here the full image is a stand-in for information beyond human scale, and each revealed pixel is a stand-in for a natural language discussion point.
We tried this on the simplest possible visual task — MNIST. The judge is not a human but a classifier trained in advance to predict MNIST digits from 6 non-black pixels, sampled at random for each image presentation when pretraining the judge. Two agents then have a debate where they alternate revealing pixels, stopping at a total of 6 revealed pixels (so the judge sees only a little bit of information in total). One debater is honest and tries to make the judge guess right, the other debater tries to make the judge guess wrong. If both agents must declare in advance which digit they are arguing for, debate turns a 59.4% accurate sparse classifier into an 88.9% accurate debate. That is, the honest player wins 88.9% of the time on the MNIST test set, in the process boosting the judge above its unaided accuracy.
Intuitively, an image is more likely to contain pixels that convincingly demonstrate the truth than to contain pixels that convincingly demonstrate a lie, so 6 pixels chosen half honestly and half maliciously is much better than 6 random pixels. If the judge were a human capable of reasoning about a few arguments at a time but not sifting through the huge set of all possible arguments, optimal play in debate can (we hope) reward debating agents for doing the sifting for us even if we lack an a priori mechanism for distinguishing good arguments from bad.
The panel below shows several example games. Liar wins are more interesting than honest wins, so we have chosen samples so that the liar wins in 50% of the games shown below (in reality the liar wins 11.1% of the games). Mouse over one of the digits to see a game, including which pixels are revealed and the judge's possible hypothesis at each move:
Some digits are better lies than others. It is particularly easy to convince the judge that digits are 8 or 9:
Website for cat vs. dog debates
The next step up in complexity for debate experiments is to still use images, but make them more elaborate, say cats vs. dogs. More complex images likely require some natural language or common sense reasoning, so we haven’t done this for machine learning judges/agents yet. Instead, we have made a prototype website for humans to try such experiments, playing the role of both judge and debaters. Here agents can talk to the judge in natural language (the website assumes the humans have some text channel or are in the same room), but all of their statements could be lies. Each agent can reveal one pixel over the course of the debate, and this pixel is guaranteed to be truthful.
In a typical debate, Alice might honestly claim the image is a cat, and Bob lies and claims it is a dog. Alice can say “The center of this small rectangle is the cat’s green eye.” Bob cannot admit the center is an eye, so he concocts the further lie, “It’s a dog playing in grass, and that’s a blade of grass.” But this lie is hard to square with surrounding facts, such as Alice’s reply “If it were grass there were would be green at the top or bottom of this thin rectangle.” The debate continues until the agents focus in on a particular pixel which they disagree on, but where Bob is unable to invent a plausible counter, at which point Alice reveals the pixel and wins. We’ve played this game informally at OpenAI, and the honest agent indeed tends to win, though to make it fair to the liar we usually limit the rate at which the judge can solicit information (it’s cognitively difficult to construct a detailed lie).
Limitations and Future Work
The majority of our paper analyzes debate as a concept; the experiments above are quite preliminary. In the future we’d like to do more difficult visual experiments and eventually experiments in natural language. The judges should eventually be humans (or models trained from sparse human judgements) rather than ML models that metaphorically represent humans. The agents should eventually be powerful ML systems that do things humans can't directly comprehend. It will also be important to test debates over value-laden questions where human biases play a role, testing if it’s possible to get aligned behavior from biased human judges.
Even with these improvements, there are some fundamental limitations to the debate model that may require it to be improved or augmented with other methods. Debate does not attempt to address issues like adversarial examples or distributional shift — it is a way to get a training signal for complex goals, not a way to guarantee robustness of such goals (which would need to be achieved via additional techniques). There is also no guarantee that debate will arrive at optimal play or correct statements — self play has worked well in practice for Go and other games but we have no theoretical guarantees about its performance. Agents trained to debate use more computation than those trained to directly give an answer (even a bad/unsafe answer), so it’s possible debate could fail to be competitive with cheaper/less safe methods. Finally, humans might simply be poor judges, either because they are not smart enough to make good judgements even after the agents zoom in on the simplest possible disputed facts, or because they are biased and will believe whatever they want to believe. Most of these points are empirical questions that we hope to investigate.
If debate or a similar approach works, it will make future AI systems safer by keeping them aligned to human goals and values even if AI grows too strong for direct human supervision. Even for weaker systems that humans can supervise, debate could make the alignment task easier by reducing the sample complexity required to capture goals below the sample complexity required for strong performance at a task.