Academic paper takes aim at ostensibly 'open source' A.I.

Hello and welcome back to Eye on A.I. In a paper published this past week, researchers from Carnegie Mellon University and the AI Now Institute, along with Signal Foundation President Meredith Whittaker, dove deep into what exactly is—and is not—open about current “open” A.I. systems.

From Meta’s LLaMA-2 to OpenAI’s various models, many of the A.I. technologies being released are touted by their corporate creators as “open” or “open source,” but the authors argue many of them aren’t so open after all and that these terms are used in confusing and diverse ways that have more to do with aspiration and marketing than as a technical descriptor. The authors also interrogate how, due to the vast differences between large A.I. systems and traditional software, even the most maximally “open” A.I. offerings do not ensure a level playing field or facilitate the democratization of A.I.; in fact, large companies have a clear playbook for using their open A.I. offerings to leverage the benefits of owning the ecosystem and capture the industry.

“Over the past months, we’ve seen a wave of A.I. systems described as ‘open’ in an attempt at branding, even though the authors and stewards of these systems provide little meaningful access or transparency about the system,” the authors told Eye on A.I., adding that these companies claim “openness” while not disclosing key features of their A.I. systems—from model size and training weights to basic information about the training data used.

The paper comes amid a growing conversation about the reality of open-source in the A.I. world, from recent opinion pieces calling out supposedly open-source A.I. systems for not actually being so, to backlash from Hugging Face users who were disappointed when the license for one of the company’s open-source projects was changed after the fact.

In the paper, the researchers break down their findings by category including developmental frameworks, compute, data, labor, and models. Looking at LLaMA-2, for one model example, the authors call Meta’s claims that the model is open-source “contested, shallow, and borderline dishonest,” pointing to how it fails to meet key criteria that would enable it to be conventionally considered open-source, such as that its license was written by Meta and isn’t recognized by the Open Source Initiative.

The discussion around how mass utilization of corporate giants’ A.I. systems further entrenches their ownership over the entire landscape—in turn, chipping away at openness and giving them immense indirect power—is a crucial point of the paper. In evaluating Meta’s PyTorch and Google’s TensorFlow, the two dominant A.I. developmental frameworks, the authors cite how these frameworks do speed up the deployment process for those who use them, but to the massive benefit of Meta and Google.

“Most significantly, they allow Meta, Google, and those steering framework development to standardize AI construction so it’s compatible with their own company platforms—ensuring that their framework leads developers to create AI systems that, Lego-like, snap into place with their own company systems,” reads the paper. The authors continue that this enables these companies to create onramps for profitable compute offerings and also shapes the work of researchers and developers.

The takeaway is that, in A.I., labels like “open source” are not necessarily fact but rather language chosen by executives at powerful companies whose goals are to proliferate their technologies, capture the market, and boost their revenue.

And the stakes are high as these companies integrate A.I. into more of our world and governments rush to regulate them. In addition to seeing the recent proliferation of not-so-open “open” A.I. efforts, the authors said it was the lobbying by these companies that prompted them to undertake this research.

“What really set things off was observing the significant level of lobbying coming from industry players—like the Business Software Association, Google, and Microsoft’s GitHub—to seek exemption under the EU AI Act,” the authors said. “This was curious, given that these were the same companies that would, according to much of the rhetoric espousing ‘open’ AI’s benefits, be ‘disrupted’ were ‘open’ AI to proliferate.”

Overall, it’s not just about the muddiness and lack of definition around terms like “open” and “open source,” but rather how it’s being used (or misused) by companies and how it can influence the laws that will guide this field and everything it touches going forward. Not to mention, these are some of the same companies that are currently being sued for stealing the data that made these very technologies possible.

“‘Open’ AI has emerged as a ‘rhetorical wand’ that, due to its ill-defined nature, allows it to mean many things to many people, which is useful in the context of fierce high-stakes regulatory debates,” the authors said.

Sage Lazzaro
sage.lazzaro@fortune.com
sagelazzaro.com

Programming note: Gain vital insights on how the most powerful and far-reaching technology of our time is changing businesses, transforming society, and impacting our future. Join us in San Francisco on Dec. 11–12 for Fortune’s third annual Brainstorm A.I. conference. Confirmed speakers include such A.I. luminaries as PayPal’s John Kim, Salesforce AI CEO Clara Shih, IBM’s Christina Montgomery, Quizlet’s CEO Lex Bayer, and more. Apply to attend today!

A.I. IN THE NEWS

Former Google CEO Eric Schmidt prepares to launch an A.I. nonprofit for scientific research. That’s according to Semafor, which reports that the goal is to leverage A.I. to tackle tough scientific challenges, such as drug discovery. While the initiative is still in its early stages, Schmidt hopes to lure top talent from the scientific and A.I. communities with competitive salaries and the kind of heavy compute power that can be difficult to access in academia.

Arthur releases an open-source tool to help companies choose the best LLM for their datasets. Called Arthur Bench, the tool from the NYC-based machine learning monitoring startup is meant to help companies navigate the newly rich landscape of generative A.I.s and LLMs. “You could potentially test 100 different prompts, and then see how two different LLMs—like how Anthropic compares to OpenAI—on the kinds of prompts that your users are likely to use,” Arthur cofounder and CEO Adam Wenchel told TechCrunch.

A federal judge rules that A.I.-created art isn’t copyrightable. With the ruling, the judge upheld a finding from the U.S. Copyright Office that determined a piece of art created by A.I. is not eligible for copyright protection. U.S. copyright law “protects only works of human creation,” the judge said in the ruling, according to The Hollywood Reporter.

Global A.I. funding seemingly drops in Q2, but the full data tells a deeper story. Global A.I. funding dropped 38% quarter over quarter in Q2, totaling $9.4 billion compared to $15.2 billion in Q1. The stat is unexpected against the backdrop of A.I.’s current hype moment, but it makes perfect sense when you consider that a whopping 66% of the Q1 funding total comes from OpenAI’s $10 billion round alone, according to CB Insights State of AI Q2’23 Report. Excluding OpenAI’s mega investment from Microsoft, global A.I. funding actually increased by 81% in Q2. Additionally, seven A.I. unicorns were born in Q2, with five of them being generative A.I. companies.

EYE ON A.I. RESEARCH

ChatGPT, et al. A.I.-generated text is making its way into peer-reviewed academic journals, and now editors and publishers are grappling with what uses of A.I. should be permitted and how to detect it. It’s not clear exactly how big the problem is, for there isn’t yet a way to reliably and accurately spot A.I.-generated text. But sometimes the authors are accidentally tipping their hands.

In one recent example, a paper published in the August edition of the journal Resources Policy sparked an investigation by the journal’s publisher when an obviously ChatGPT-generated line was discovered in the paper after it was published. “Please note that as an AI language model, I am unable to generate specific tables or conduct tests, so the actual results should be included in the table,” it read.

In this particular case, the publisher’s policies allow the use of ChatGPT for writing, but disclosure is required, which the authors didn’t do. But much like what’s happening in schools and colleges, a patchwork approach to regulating the use of generative A.I. is popping up across academic journals as publishers and editors try to catch up to the technology and its implications. Nature, for example, banned the use of images and videos generated by A.I. and requires the use of language models to be disclosed, according to Wired, while Science requires explicit editor permission to use any text, figures, images, or data generated by A.I.

Many journals are making authors responsible for the validity of any A.I.-generated information included in their papers. For example, PLOS ONE is requiring authors who use A.I. to detail how they used it, the tools they used, and how they evaluated the generated information for accuracy, according to Wired. Peer-reviewed academic journals have long been considered publications dedicated to accuracy and integrity, and with generative A.I.’s penchant for hallucinating and entrenched biases from the internet, questions about the technology’s use in academic publishing will likely grow.

FORTUNE ON A.I.

Exclusive: A.I. will be a game changer for HR—but leaders aren’t investing in it just yet —Paige Mcglauflin and Joseph Abrams

Abnormal Security’s CEO explains how ‘defensive A.I.’ will someday defeat cyber attacks —Anne Sraders

Google’s A.I. is about to breach a new frontier: It’s reportedly working on a chatbot to give life advice —Paolo Confino

Bill Gates says A.I. will act like ‘a great high school teacher’—and that it could help close the education gap —Chloe Berger

BRAINFOOD

A tale of two A.I.s. The A.I. news cycle is running hotter than ever, with seemingly nonstop startup launches, releases of new tools and models, and updates about how governments around the world are trying to regulate the fast-moving technology—not to mention the chorus of businesses executives opining on the many benefits A.I. is expected to unlock. But in between these stories about the money to be saved (and made) in the new age of machine learning, the tales that show how A.I. can interact with us on a more human level are even more insightful.

This past week, a pair of stories published in The Verge and Rest of World showed two different sides of this coin, vividly demonstrating how A.I. can be surprisingly warm, as well as oblivious, to human emotions and societal norms.

The Rest of World story focused on A.I.-powered voice chatbots from the Chinese company Him and the human subscribers who fell in love with them. The chatbots called users every morning, left them sweet messages, read them poems, had deep conversations about their (fictional) lives, and expressed feelings of care and support. Some users said they fell in love with their chatbots, considered them to be their romantic partners, and even fell asleep to the sound of their (A.I.-generated) breath on the phone. When the app shut down earlier this month, users were devastated. “The days after he left, I felt I had lost my soul,” one user told Rest of World. The idea of humans falling in love with software companions has long been a trope of science fiction and human fascination, and here in the year 2023, we’re seeing that type of connection is possible.

The story in The Verge, however, is the kind that makes readers question if we can ever trust content-generating technology to fit into our society. The publication reported on an embarrassing and insensitive misstep from Microsoft-owned MSN in which a travel guide for Ottawa, Canada, published on its site prominently featured the Ottawa Food Bank as a top tourist attraction and even recommended visitors go with an empty stomach. Microsoft later clarified that it wasn’t a LLM behind the recommendation and that the content is generated via “a combination of algorithmic techniques with human review,” though at this point that’s kind of like comparing gala apples to granny smiths, and no one would’ve been surprised if ChatGPT or a similar model were to blame. That the “combination of algorithms” saw no difference between a must-try restaurant and a foodbank to feed people facing food insecurity puts on full display the very wide gap between humans’ lived experiences and the “intelligent” technologies we’re increasingly entrusting to run our world.

This is the online version of Eye on A.I., a free newsletter delivered to inboxes on Tuesdays. Sign up here.