Ultimately, AI systems are only useful and safe as long as the goals they’ve learned actually mesh with what humans want them to do, and it can often be hard to know if they’ve subtly learned to solve the wrong problems or make bad decisions in certain conditions.
To make AI easier for humans to understand and trust, researchers at the nonprofit research organization OpenAI have proposed training algorithms to not only classify data or make decisions, but to justify their decisions in debates with other AI programs in front of a human or AI judge.
“Given a question or proposed action, two agents take turns making short statements up to a limit, then a human judges which of the agents gave the most true, useful information,” write OpenAI researchers Geoffrey Irving, Paul Christiano and Dario Amodei in a new research paper. The San Francisco-based AI lab is funded by Silicon Valley luminaries including Y Combinator President Sam Altman and Tesla CEO Elon Musk, with a goal of building safe, useful AI to benefit humanity.
Since human time is valuable and usually limited, the researchers say the AI systems can effectively train themselves in part by debating in front of an AI judge designed to mimic human decision making, similar to how software that plays games like Go or chess often trains in part by playing against itself.
In an experiment described in their paper, the researchers set up a debate where two software agents work with a standard set of handwritten numerals, attempting to convince an automated judge that a particular image is one digit rather than another digit, by taking turns revealing one pixel of the digit at a time. One bot is programmed to tell the truth, while another is programmed to lie about what number is in the image, and they reveal pixels to support their contentions that the digit is, say, a five rather than a six.
The truth-telling bots tend to reveal pixels from distinctive parts of the digit, like the horizontal line at the top of the numeral “5,” while the lying bots, in an attempt to deceive the judge, point out what amount to the most ambiguous areas, like the curve at the bottom of both a “5” and a “6.” The judge ultimately “guesses” which bot is telling the truth based on the pixels that have been revealed.The image classification task, where most of the image is invisible to the judge, is a sort of stand-in for complex problems where it wouldn’t be possible for a human judge to analyze the entire dataset to judge bot performance. The judge would have to rely on the facets of the data highlighted by debating robots, the researchers say.
“The goal here is to model situations where we have something that’s beyond human scale,” says Irving, a member of the AI safety team at OpenAI. “The best we can do there is replace something a human couldn’t possibly do with something a human can’t do because they’re not seeing an image.”
To test their hypothesis—that two debaters can lead to honest behavior even if the debaters know much more than the judge—the researchers have also devised an interactive demonstration of their approach, played entirely by humans and now available online. In the game, two human players are shown an image of either a dog or a cat and argue before a judge as to which species is represented. The contestants are allowed to highlight rectangular sections of the image to make their arguments—pointing out, for instance, a dog’s ears or cat’s paws—but the judge can “see” only the shapes and positions of the rectangles, not the actual image. While the honest player is required to tell the truth about what animal is shown, he or she is allowed to tell other lies in the course of the debate. “It is an interesting question whether lies by the honest player are useful,” the researchers write.
The researchers emphasize that it’s still early days, and the debate-based method still requires plenty of testing before AI developers will know exactly when it’s an effective strategy or how best to implement it. For instance, they may find that it may be better to use single judges or a panel of voting judges, or that some people are better equipped to judge certain debates.
It also remains to be seen whether humans will be accurate judges of sophisticated robots working on more sophisticated problems. People might be biased to rule in a certain way based on their own beliefs, and there could be problems that are hard to reduce enough to have a simple debate about, like the soundness of a mathematical proof, the researchers write.
Other less subtle errors may be easier to spot, like the sheep that Shane noticed had been erroneously labeled by Microsoft’s algorithms. “The agent would claim there’s sheep and point to the nonexistent sheep, and the human would say no,” Irving writes in an email to Fast Company.
But deceitful bots might also learn to appeal to human judges in sophisticated ways that don’t involve offering rigorous arguments, Shane suggested. “I wonder if we’d get kind of demagogue algorithms that would learn to exploit human emotions to argue their point,” she says.