Tests are just as important an indicator of progress in AI as for the rest of the software industry. But when test results come from corporations, secrecy very often prevents the community from testing them.
For example, OpenAI has given Microsoft, with which it has a commercial relationship, exclusive license rights to its powerful language model GPT-3. Other organizations argue that the code they use to develop systems depends on the inability to release internal tools and infrastructure or uses copyrighted datasets. While motivation can be ethical in nature – OpenAI originally refused release GPT-2, the predecessor of GPT-3, for fear that it may be used incorrectly – the effect is the same. Without the necessary code, it is much harder for outside researchers to verify an organization’s claims.
“This is not really a sufficient alternative to a good open source industry,” said Colombia’s PhD in Computer Science. Candidate Gustav Ahdritz told TechCrunch by email. Ahdritz is one of the leading developers of OpenFold, an open source version of the DeepMind protein structure prediction AlphaFold 2. “It’s hard to do all the science you might want to do with the code released by DeepMind.”
Some researchers go so far as to say that hiding the system’s code “undermines its scientific value.” In October 2020, a rebuttal published in the magazine Nature expressed a problem with the cancer prediction system prepared by Google Health, a Google affiliate focused on health-related research. The co-authors noted that Google hid key technical details, including a description of how the system was designed, which could significantly affect its performance.
Instead of change, some members of the AI community, such as Ahdritz, set out to open the source code of the systems themselves. Working with technical documents, these researchers are diligently trying to recreate systems either from scratch or based on fragments of publicly available specifications.
OpenFold is one such effort. According to Ahdritz, launched shortly after DeepMind announced AlphaFold 2, the goal is to make sure AlphaFold 2 can be played from scratch and make available system components that may be useful elsewhere.
“We believe DeepMind has provided all the necessary details, but … we don’t [concrete] proof of this, and therefore these efforts are the key to securing this trail and allowing others to build on it, ”Ahdritz said. “Moreover, initially some components of AlphaFold were under a non-commercial license. Our components and data – DeepMind has yet to publish complete training data – will be completely open source, allowing us to come into the industry. ”
OpenFold is not the only project of its kind. Elsewhere, loosely connected groups in the AI community are trying to implement an OpenAI code that generates code and creates art. DALL-EDeepMind plays chess AlphaZeroand even AlphaStarDeepMind system designed to play real-time strategy StarCraft 2. Among the more successful EleutherAI and Hugging Face’s AI startup BigScienceopen research efforts that aim to deliver the code and datasets needed to run a model comparable (though not identical) to GPT-3.
Philip Wang, a prolific member of the AI community who supports a number of open source implementations on GitHub, including one from OpenAI DALL-E, argues that using these open source systems reduces the need for researchers to duplicate their efforts.
“We read the latest AI research like any other researcher in the world. But instead of duplicating paper in the bunker, we are implementing it with open source code, ”Wang said. “We are in an interesting place at the junction of computer science and industry. I think open source is not one-sided and ultimately benefits everyone. It also appeals to a broader vision of a truly democratized AI, not obligated by shareholders. ”
Brian Lee and Andrew Jackson, two Google employees, worked together on the creation MiniGo, a copy of AlphaZero. Although Lee and Jackson are not affiliated with the official project, they, being at Google, the original parent company of DeepMind, had an advantage in accessing certain of their own resources.
«[Working backward from papers is] like navigating before we got GPS, ”Lee, a Google Brain research engineer, said in an email to TechCrunch. “The instructions talk about landmarks that you need to see, how long you need to go in a certain direction, which fork to take at a critical moment. There are enough details so that an experienced navigator can find the way, but if you do not know how to read the compass, you will hopelessly get lost. You will not follow the steps exactly, but you will be in the same place. “
The developers behind these initiatives, including Adritz and Jackson, say they will not only help demonstrate whether the systems work as advertised, but will also enable new applications and better hardware support. Systems from large labs and companies such as DeepMind, OpenAI, Microsoft, Amazon, and Meta typically train on expensive patented data center servers with much more computing power than a conventional workstation, which adds to the hassle of their open source.
“Learning new variants of AlphaFold could lead to new applications beyond predicting protein structure, which is not possible with the original release of the DeepMind code because it lacked training code – such as predicting how drugs bind proteins, how proteins move and how proteins interact with other biomolecules, ”Ahdritz said. “There are dozens of high-impact applications that require learning new AlphaFold variants or integrating AlphaFold parts into larger models, but the lack of a training code hinders them all.”
“These open source efforts are doing a lot to spread“ working knowledge ”about how these systems can behave in a non-academic setting,” Jackson added. “The amount of computation required to reproduce the original results [for AlphaZero] quite high. I don’t remember the numbers in my head, but it involved running about a thousand GPUs in a week. We were in a rather unique position to help the community try out these models with our early access to the Google cloud platform TPU a product that has not yet been made public. ”
The introduction of proprietary systems in open source is fraught with problems, especially when public information is scarce. Ideally, the code is available in addition to the data set used to train the system, and to the so-called scales responsible for converting the data coming into the system into predictions. But it’s not that common.
For example, in developing OpenFold, Adritz and team had to gather information from official materials and reconcile differences between different sources, including source code, additional code, and presentations that DeepMind researchers conducted in the beginning. Uncertainties at stages such as data preparation and training code have led to errors, while a lack of hardware resources has necessitated trade-offs in design.
“We only have a few attempts to do it right so that it doesn’t drag on indefinitely. These things have so many computationally stressful stages that a tiny mistake can put us back hard, so we had to relearn the model as well as recover a lot of training data, ”Adritz said. “Some technical details that work very well [DeepMind] doesn’t work so easily for us because we have other equipment … Also, the ambiguity as to which parts are critical and which ones are chosen without much thought makes it difficult to optimize or configure anything and closes us in any (sometimes inconvenient) choice was made in the original system ”.
Thus, the laboratories behind proprietary systems such as OpenAI make sure that their work is rebuilt and even used by startups to run competing services? Obviously not. Adritz says the fact that DeepMind, in particular, publishes so much detail about its systems suggests that it elusively supports these efforts, even when not talking about it publicly.
“We have not received any clear indications that DeepMind does not approve or endorses these efforts,” Ahdritz said. “But of course no one tried to stop us.”