Last year, computer scientists at the University of Montreal (U of M) in Canada were eager to show off a new speech recognition algorithm, and they wanted to compare it to a benchmark, an algorithm from a well-known scientist. The only problem: The benchmark’s source code wasn’t published. The researchers had to recreate it from the published description. But they couldn’t get their version to match the benchmark’s claimed performance, says Nan Rosemary Ke, a Ph.D. student in the U of M lab. “We tried for 2 months and we couldn’t get anywhere close.”
[…]
The most basic problem is that researchers often don’t share their source code. At the AAAI meeting, Odd Erik Gundersen, a computer scientist at the Norwegian University of Science and Technology in Trondheim, reported the results of a survey of 400 algorithms presented in papers at two top AI conferences in the past few years. He found that only 6% of the presenters shared the algorithm’s code. Only a third shared the data they tested their algorithms on, and just half shared “pseudocode”—a limited summary of an algorithm. (In many cases, code is also absent from AI papers published in journals, including Science and Nature.)
[…]
Assuming you can get and run the original code, it still might not do what you expect. In the area of AI called machine learning, in which computers derive expertise from experience, the training data for an algorithm can influence its performance. Ke suspects that not knowing the training for the speech-recognition benchmark was what tripped up her group. “There’s randomness from one run to another,” she says. You can get “really, really lucky and have one run with a really good number,” she adds. “That’s usually what people report.”
[…]
Henderson’s experiment was conducted in a test bed for reinforcement learning algorithms called Gym, created by OpenAI, a nonprofit based in San Francisco, California. John Schulman, a computer scientist at OpenAI who helped create Gym, says that it helps standardize experiments. “Before Gym, a lot of people were working on reinforcement learning, but everyone kind of cooked up their own environments for their experiments, and that made it hard to compare results across papers,” he says.

IBM Research presented another tool at the AAAI meeting to aid replication: a system for recreating unpublished source code automatically, saving researchers days or weeks of effort. It’s a neural network—a machine learning algorithm made of layers of small computational units, analogous to neurons—that is designed to recreate other neural networks. It scans an AI research paper looking for a chart or diagram describing a neural net, parses those data into layers and connections, and generates the network in new code. The tool has now reproduced hundreds of published neural networks, and IBM is planning to make them available in an open, online repository.

Source: Missing data hinder replication of artificial intelligence studies | Science | AAAS