What on Earth is Open Source AI?

I want to talk about a recent conversation on the Open Source AI definition, but before that I want to do an acknowledgement. My position on the uptake of "AI" is that it is morally unconscionable, short-sighted, and frankly, just stupid. In a time of snowballing climate crisis and an impending environmental doom, not only are we diverting limited resources away from climate justice, we're routing them to contribute to the crisis.

Not only that, the utility and societal relevance of LLMs and neural networks has been vastly overstated. They perform consistently worse than traditional computing and people doing the same jobs and are advertised to replace jobs and professions that don't need replacing. Furthermore, we've been assaulted with a PR campaign of highly polished plagiarizing mechanical turks that hide the human labor involved, and shifts the costs in a way that furthers wealth inequality, and have been promised that they will only get better (are they? And better for whom?)

However since the world seems to have lost the plot, and until all the data centers are under sea water, some of us have to engage with "AI" seriously, whether to do some unintentional whitewashing under the illusion of driving the conversation, or for much needed harm reduction work, or simply for good old fashioned opportunism.

The modern tale of machine learning is intertwined with openwashing, where companies try to mislead consumers by associating their products with open source without actually being open or transparent. Within that context, and as legislation comes for "AI", it makes sense that an organization like the Open Source Initiative (OSI) would try and establish a definition of what constitutes Open Source "AI". It's certainly not an easy task to take on.

The conversation that I would like to bring to your attention was started by Julia Ferraioli in this thread (noting that the thread got a bit large, so the weekly summaries posted by Mia Lykou Lund might be easier to follow). Julia argues that a definition of Open Source "AI" that doesn't include the data used for training the model cannot be considered open source. The current draft lists those data as optional.

Steffano Maffulli published an opinion to explain the side of the proponents of keeping training data optional. I've tried to stay abreast of the conversations, but they're has been a lot of takes and a lot of platforms where these conversations are happening, so I will limit my take to that recently published piece.

Reading through it, I'm personally not convinced and fully support the position that Julia outlined in the original thread. I don't dismiss the concerns that Steffano raised wholesale, but ultimately they are not compelling. Fragmented global data regulations and compliance aren't a unique challenge to Open Source "AI" alone, and should be addressed on that level to enable openness on a global scale.

Fundamentally, it comes down to this: Steffano argues that this open data requirement would put "Open Source at a disadvantage compared to opaque and proprietary AI systems." Well, if the price of making Open Source "AI" competitive with proprietary "AI" is to break the openness that is fundamental to the definition, then why are we doing it? Is this about protecting Open Source from openwashing or accidentally enabling it because the right thing is hard to do? And when has Open Source not been at a disadvantage to proprietary systems?

I understand that OSI is navigating a complicated topic and trying to come up with an alternative that pleases everyone, but the longer this conversation goes on, it's clear that at some point a line needs to be drawn, and OSI has to decide which side of the line it wants to be on.

EDIT (June 15th, 17:20 CET): I may be a bit behind on this, I just read a post by Tom Callaway from two weeks ago that makes lots of the same points much more eloquently and goes deeper into it, I highly recommend reading that.