Hello and welcome to Eye on AI. In this edition…what’s wrong with the way we regulate medical AI and how one startup plans to fix it; OpenAI rolls out its o1 reasoning model and says its structure will change; Can AI chatbots implant false memories?
AI is poised to have a huge impact on medicine. As I write in my book, Mastering AI: A Survival Guide to Our Superpowered Future, it’s one of the areas where I am most optimistic about the technology’s likely effects.
But to reap these benefits, we should be careful about how we design AI medical software, how we use it, and how we regulate it.
Bad (AI) medicine is not what we need
As with all AI applications, the risks stem from bad or biased data. Medical research has historically suffered from the underrepresentation of women and people of color in studies. Training AI on this data can lead to models don’t work well for these patients.
Computer vision systems that analyze medical imagery can easily suffer from “overfitting”—learning to perform well on a particular test data set, but doing so in a way that is not clinically relevant and won’t hold up in the real world. Famously, one AI model designed to identify serious cases of pneumonia in chest X-rays learned to place a great deal of emphasis on letters found in the margins of the X-ray film that indicated whether the image had been taken by a portable chest X-ray or a standard one. Portable chest X-rays are, of course, used on the sickest patients. So the AI had learned that the presence of the letter “P”—for portable—was the best predictor of bad pneumonia cases, rather than learning much about the appearance of patients’ lungs. On images without such markings, the AI was useless.
Headline accuracy figures can also be misleading. An AI that correctly identifies 95% of pathologies on chest X-rays sounds great—except if happens to miss a particularly aggressive type of lung tumor most of the time. Too many false positives matter too. At best, they annoy doctors, making them less likely to use the software. At worst, they could lead to incorrect diagnoses.
Luckily, you might think, we have medical regulatory bodies to guard us against these dangers, while also ensuring important medical AI innovations reach patients quickly. You’d be wrong. Our current regulatory procedures are poorly suited for AI.
Lack of clinical validation
In the U.S., the Food and Drug Administration has approved close to 1,000 AI-enabled “medical devices” (which can include software as well as hardware with AI features). The vast majority of these (97%) have been approved through a process known as 510(k) that allows for faster approvals so long as the software vendor shows that their product is “substantially equivalent” to a previously approved device.
But state-of-the-art in AI changes rapidly, making it difficult to say the performance of a new AI model is equivalent to older forms of software.
More importantly, vendors are allowed to test their AI software on historical data. They don’t need to prove it improves patient outcomes in real-world clinical settings. In a recent paper published in Nature Medicine, researchers found that 43% of FDA-approved AI medical devices lacked any clinical validation data. And, out of 521 AI devices the researchers examined, only 22 had submitted results from randomized control trials, the gold standard for validating therapies.
The FDA rules were designed for hardware, which is generally upgraded infrequently. It never anticipated a world of Agile software development, with weekly app updates. The FDA has introduced “Predetermined Change Control Plans” (PCCP) to allow minor software updates on a preset schedule, but this still doesn’t fully address the needs of AI models, some of which can learn continuously.
One U.K. startup thinks there is a better way
In the U.K. and Europe, the situation is more flexible, but still has drawbacks. Here, government medical regulators outsource the authorization of medical devices to designated private companies called “notified bodies.”
I recently met with Scarlet, a startup that is a notified body for both the U.K. and the EU, specializing in AI medical software. It’s creating a technology platform that makes it much easier for AI vendors to submit their market authorization applications for review.
James Dewar, Scarlet’s cofounder and CEO, tells me that the company’s technology helps standardize submission documentation and automatically checks if a vendor’s application is complete, saving days to weeks of time. Most importantly, software developers can submit updates to their software as frequently as they wish, and get approvals for these software updates in days, instead of the six to eight months it could take in the past.
Dewar and his cofounder, Jamie Cox, both previously worked on medical AI models at former U.K. health tech company Babylon Health (later bought by eMed Healthcare). But Scarlet’s platform doesn’t use AI itself—at least not yet, although Dewar says the company is considering how large language models might help. Human experts review the substance of each application, something that is unlikely to change, he said.
Buyer beware
More troublingly, Dewar told me that there are no explicit requirements for notified bodies to examine how well a product performs for patient subgroups or disease subtypes—or how they should deal with AI concepts such as bias and model drift.
Vendors are not required, for instance, to submit confusion matrices, a table that shows how performance varies—on metrics such as false positive and false negative rates—across different patient groups, although Scarlet does currently ask vendors to submit these metrics.
“There’s an element of buyer beware,” Dewar says of AI medical devices. “At the moment, the regulation doesn’t ask us to do anything about bias. We would welcome those changes, but that is not what the current regulations specify.” He also said there was “a balance” to be struck between increasing requirements around clinical effectiveness and the need to “get innovation to market.”
A model for the EU AI Act?
Scarlet just received a $17.5 million Series A venture capital investment from London-based venture capital firm Atomico, with participation from prior investors Kindred Capital, Creandum, and EF (Entrepreneur First). The company is hoping to expand into the U.S., where the FDA uses accredited private organizations to conduct initial reviews of 510(k) applications—although unlike in Europe, in the U.S. these private companies do not have the final say on authorization.
Dewar said Scarlet was also considering branching out into certification of AI software in other high-risk settings besides medicine. Under the EU AI Act, any company deploying AI software in high-risk areas such as controlling electricity or water supplies, or grading university entrance exams, must have an outside party verify its risk-assessment and mitigation processes. A big question has been: Which organizations will have the expertise to conduct these checks? Well, Scarlet might be one.
And with that, here’s more AI news.
Jeremy Kahn
jeremy.kahn@fortune.com
@jeremyakahn
Correction, Sept. 18: An earlier version of this story incorrectly identified the company where Dewar and Cox worked prior to founding Scarlet. It was Babylon Health, not Benevolent AI. The story also been updated to clarify that the months-to-days speed advantage Scarlet’s platform provides is for updates to AI software that has been previously approved, not for initial authorization applications, as well as to clarify that Scarlet currently does ask its customers to submit confusion matrices even though there is no legal requirement that they do so.
Before we get to the news. If you want to learn more about AI and its likely impacts on our companies, our jobs, our society, and even our own personal lives, please consider picking up a copy of my book, Mastering AI: A Survival Guide to Our Superpowered Future. It’s out now in the U.S. from Simon & Schuster, and you can order a copy today here. In the U.K. and Commonwealth countries, you can buy the British edition from Bedford Square Publishers here.
AI IN THE NEWS
OpenAI unveiled o1, a model that can handle difficult math and reasoning tasks. The model is trained to explore different possible logical pathways to answering a prompt before selecting the one most likely to be correct. It can solve math, logic, and wordplay puzzles that stumped previous models. But it also takes longer to answer—sometimes as long as half a minute. It also uses more computing power and is significantly more expensive for enterprise customers to use. OpenAI has given its premium users access to a preview version of o1 that is not quite as powerful as the full model, which is still under development, as well as a o1-mini version fine-tuned for math and coding questions. According to OpenAI’s own assessments, o1 presents some dangers, including a “medium risk” of helping someone pulling off a biological attack. You can read more of my coverage of o1 here.
OpenAI tells staff that it plans to alter its structure. CEO Sam Altman told staff that the company plans to change its unusual and complicated corporate structure, likely next year, and move away from having its non-profit board control its activities, according to a scoop from my Fortune colleague Kali Hays. The changes are seen as necessary to help attract further investment into OpenAI, which is reportedly in talks to raise as much as $6.5 billion at a valuation of $150 billion. An OpenAI spokesperson said that the non-profit entity will continue to exist—but didn't say whether the non-profit will continue to have the ultimate say over what happens to the technology OpenAI develops.
Intel wins AWS chipmaking deal, agrees to spin off its foundry business. Intel signed up Amazon’s AWS cloud service to produce a new AI chip using its new 18A chipmaking process in a major win for the chipmaker. The company is betting on 18A to position its foundry business—which will make chips for both Intel and other companies—as a credible alternative to TSMC, which dominates the business of contract manufacturing of high-end chips. Intel’s board also agreed to a plan that will see its foundry business spun off as a separate subsidiary from Intel’s now “fabless” chip design division. This will allow the foundry business to attract outside investment and offer Intel foundry customers assurances that their design innovations won’t leak to Intel’s own chip offerings. You can read more from Bloomberg News here.
Microsoft rolls out AI agents and new AI features in Copilot. The company announced the wide release of AI agents that let users automate tasks across its 365 Copilot apps and software from numerous other vendors. It also introduced Pages, a new file type that is a workspace where both humans and AI agents can share documents and collaborate. And it unveiled new generative AI features for Excel, PowerPoint, and Outlook. The launches both put Microsoft towards the front of the pack in creating AI agents, which many technologists think is the next big tech platform, and try to combat grumbling from some chief information officers that Microsoft’s Copilot products don’t produce enough value to justify their cost. You can read more of my coverage of the Copilot news here.
Groq scores big with Saudi Aramco data center deal. AI chip startup Groq won a major contract to equip a large data center being built in Saudi Arabia by the country’s oil giant Aramco, Bloomberg News reported. The data center is part of the Saudi government’s effort to diversify its economy from oil and establish the Kingdom as a major hub for high-tech innovation. Groq has established a reputation for chips that can run AI models, once they are already trained, at speeds faster than typical graphics processing units (GPUs), such as those sold by Nvidia.
EYE ON AI RESEARCH
New Google DeepMind model could directly improve drug discovery. Google DeepMind has already given a massive boost to biological research with its AlphaFold models that can predict the structure of any protein from its DNA sequence. This is already helping researchers, including those at sister Alphabet company Isomorphic Labs, speed up the process of discovering possible new medicines. But now Google DeepMind has gone a step further, unveiling AlphaProteo, a new AI model that can design a protein that will bind to any target molecule. AlphaProteo is likely to have a direct impact on drug design, enabling researchers to more easily create new protein-based therapies, and possibly letting them better study why certain small molecules have various effects. You can read more about AlphaProteo in DeepMind’s blog here.
FORTUNE ON AI
Larry Ellison and Elon Musk ‘begged’ Nvidia’s Jensen Huang for more GPUs over a fancy sushi dinner—by Amanda Gerut
Book excerpt: How Elon Musk, Sam Altman, and the Silicon Valley elite manipulate the public—by Gary Marcus
Spanish socialist Teresa Ribera likely to become EU antitrust enforcer, as Ursula von der Leyen names new team—by David Meyer
AI might actually help us find a greater sense of purpose at work—by Hillary Hoffower
AI CALENDAR
Sept. 17-19: Dreamforce, San Francisco
Sept. 25-26: Meta Connect, Menlo Park, Calif.
Oct. 22-23: TedAI, San Francisco
Oct. 28-30: Voice & AI, Arlington, Va.
Nov. 19-22: Microsoft Ignite, Chicago
Dec. 2-6: AWS re:Invent, Las Vegas
Dec. 8-12: Neural Information Processing Systems (Neurips) 2024, Vancouver, British Columbia
Dec. 9-10: Fortune Brainstorm AI, San Francisco (register here)
BRAIN FOOD
Could LLM-based AI chatbots be used to implant false memories? That was the question behind a study conducted by researchers at MIT and University of California at Irvine in which test subjects watched security camera footage of a robbery and then had a conversation with an AI chatbot that asked them questions about what they had just seen. Except, in this case, the chatbots had been previously prompted by the researchers to ask leading questions designed to convince the “witnesses” that they had seen things that had not actually occurred in the video.
It is well known that eyewitness testimony—particularly of traumatic events—is horribly unreliable and easily shaped by the opinions of others or the way in which we are asked to recall events.
But LLM-based chatbots' ability to shape people’s perceptions of events was far more potent than other methods of trying to implant memories, such as old-fashioned surveys with leading questions or conversations with a pre-scripted chatbot.
It seems the ability of the generative AI chatbot to shape each question based on the previous answers of the test subjects gave it particular power. Every day we are learning how persuasive AI chatbots are and how they could be used to shape public perceptions—for good and ill. You can read the research paper on arxiv.org here.