The Open Source Initiative (OSI) today released its open source AI definition version 1.0 to clarify what constitutes open source AI. This gives the industry a standard by which to validate whether or not an AI system can be deemed Open Source AI.
The definition covers code, model, and data information, with the latter being a contentious point due to legal and practical concerns. Mozilla, a long-time open source advocate, is partnering with OSI to promote openness in AI, advocating for transparency in AI systems.
The need to understand how AI systems work, so they can be researched, scrutinized and potentially regulated, is important to ensure the system is truly open source. Ayah Bdeir, senior strategic advisor on AI strategy at Mozilla, told SD Times on the “What the Dev?” podcast that AI systems are influenced by a number of different components – algorithms, code, hardware, data sets and more.
As an example, she cited that there are data sets to train models, data sets to test, and data sets to fine tune, and this false sense of transparency leads organizations to claim their systems are open source. “When it comes to AI in traditional open source software, there’s a very clear separation between code that is written, a compiler that is used, and a license that is possessed. Each one of them can have an open license or a closed license and it’s very clear how each one of them applies to this concept of openness.”
However, in AI systems, many components influence the system, Bdeir said. “There are algorithms, there’s code, there’s hardware, there are data sets. There’s a data set to train, there’s a data set to test, there’s a data set to fine tune, and sort of this idea that if the code is open, that means their AI systems are open, which is not accurate.” This does not allow the fundamental reuse or study of the system that is required under an open source mentality, which is the actual four freedoms – use, study, modify and share, she explained.
“The open source AI definition by OSI is an attempt to put a real fine point on what open source AI is and isn’t, and how to have a checklist that checks for whether something is or isn’t, so that this ambiguity between claiming that something is open source or actually doing it is not is not there anymore,” she said.
The debate over data information was among the most controversial in coming up with the definition, Bdeir said. How do organizations that are training their models with proprietary data protect it from being used in open source AI? Bdeir explained there are schools of thought around data in particular. In one school of thought, the data set must be made completely open and available in its exact form for this AI system to be considered open source. “Otherwise,” she said, “you cannot replicate this AI system. You cannot look at the data itself to see what it was trained on, or what it was fine tuned on, etc. And therefore it’s not really open source.”
In another school of thought, where she said some of the more hands-on builders reside, making the data available is not realistic. “Data is governed by laws that are different in different countries. Copyright laws are different in different countries, and licenses on data are not always super clear and easy to find, and if you inadvertently or mistakenly distribute data sets that you have no rights to, you are liable legally.”
The OSI solution to this problem is to talk about data information. What OSI is requiring is data information, not the data in a data set. The wording, Bdeir said, says the organization must provide “sufficiently detailed information about the data used to train the system so that a skilled person can recreate a substantially equivalent system using the same or similar data.”