Alphabet Paper Outlines Risks And Safeguards For AI

Artificial intelligence is staggeringly smart these days, and will only get smarter as the field continues to see advancements from both the usual suspects and unexpected places. According to Facebook CEO Mark Zuckerberg, AI won’t take all that long to match humans, likely able to take on basic tasks better than us in 5 years’ time. An AI uprising may be the stuff of science fiction novels and late-night TV for now, but a time is approaching where such a thing becomes a real risk, although malfunctions and unintended uses of core programming are more likely risks with AI. Things like a robot trying to mop an electrical socket or a corner-cutting AI cobbling together non-functional bits of code based on being told to include certain functions and have it do certain things are quite realistic issues, even now, and researchers in cahoots with Alphabet, who have worked on things like TensorFlow, have put out a paper in the Cornell University Library detailing some of the possible risks of AI and how best to address them.

In the paper, titled “Concrete Problems In AI Safety”, the researchers address possible accidents in AI programming due to “poor design” by presenting five possible scenarios of an AI messing up and possible solutions for the issues. They also speak on the matter of “preventing negative side effects” such as the behaviors outlined above. Their idea on the best way to prevent side effects is to create a penalty system; essentially, the AI should “learn” how to tell when it’s doing what it’s supposed to and be “punished” for changing its environment or straying outside the lines. This will result in the AI figuring out ways to accomplish its goals without engaging in behavior that would warrant punishment. Avoiding the possibility of punishable actions, such as by teaching a cleaning robot not to bring a bucket of water into an electrical room, was also touched on.

The paper also touched on “reward hacking” by way of a few possible experiments. One involved partially observed goals, where a bot received reward in proportion to how close it came to accomplishing a goal, which made the AI work toward figuring out why it was rewarded and how to get rewarded more, or to find the best way to complete a task. They also talked about the role of complication in deciding and triggering reward, abstract rewards to instill values, avoiding “feedback loops” which may trigger task overkill, “environmental embedding”, or the possibility of an AI learning to sidestep normal requirements in a bad way to get rewards, and Goodhart’s Law, saying that reward systems must not allow for an AI to figure out the metric that rewards are based on, lest it inflate this metric artificially and potentially cause problems.