• Home
  • Latest
  • Fortune 500
  • Finance
  • Tech
  • Leadership
  • Lifestyle
  • Rankings
  • Multimedia
TechAI

A startup tested if ChatGPT and other AI chatbots could understand SEC filings. They failed about 70% of the time and only succeeded if told exactly where to look

Paolo Confino
By
Paolo Confino
Paolo Confino
Reporter
Down Arrow Button Icon
Paolo Confino
By
Paolo Confino
Paolo Confino
Reporter
Down Arrow Button Icon
December 21, 2023, 7:00 AM ET
OpenAI CEO Sam Altman
Sam Altman, chief executive of OpenAI.Dustin Chambers—Bloomberg/ Getty Images

Wall Street financial types who hoped that AI would soon take over the tedious job of sifting through page after page of earnings reports might have to wait a little longer. It turns out AI models from some of the most prominent tech companies, including OpenAI, Meta, and Anthropic, are exceptionally bad at analyzing SEC filings and other financial documents. 

Recommended Video

The poor results mean that many tasks in the financial services industry that require lots of math and research will still have to be done by humans. Although financial companies have started to explore AI tools in recent months, research from startup Patronus AI published in November demonstrates they may still be unable to perform more complex tasks without human oversight. 

For its study, Patronus AI, cofounded by two ex-Meta employees, fed large language models from OpenAI, Meta, and Anthropic varying amounts of data from SEC documents and then asked the AI to respond to 150 questions. It found that the models were either unable to respond to the prompts they were given or spit out incorrect answers, or “hallucinations” in AI lingo. For Patronus CEO Anand Kannappan this outcome was largely expected. 

“After speaking with folks in the financial services industry over several months who experienced this hands-on, we suspected something like this might happen,” he tells Fortune. “We heard about how analysts are spending lots of time crafting prompts and manually inspecting outputs, only to discover hallucinations often.” 

When companies run tests meant to limit hallucinations, or any other issues, they’re sometimes done haphazardly, he adds. That’s partly because refining the model takes time, is expensive, and requires substantial expertise, either internally, or from outside help—further raising the costs. “A lot of companies approached it like a vibe check,” Kannappan says. “Just testing by inspection—people just cross their fingers and hope everything’s going to be okay.” 

Best practices for AI testing are still a work in progress, like AI itself. “I don’t know that there are any,” says Vasant Dhar, a professor at NYU’s Stern School of Business, who also founded SCT Capital Management, a hedge fund that uses machine learning to make investment decisions. “People haven’t really spent the time, or the effort or exercised the discipline involved in conducting proper scientific evaluations of this methodology.” 

Some financial companies have joined the AI race themselves, by starting a variety of projects including AI for helping financial advisors with some of their day-to-day duties. But they’re not tailor made for deciphering SEC documents, like what Patronus tested. For example, Bloomberg published research about a large language model that specializes in financial analysis, with much of its data coming from Bloomberg’s data operations. Meanwhile, BNY Mellon opened an AI research center in Dublin, Ireland. Others like JPMorgan and Goldman Sachs are developing their own generative AI tools in-house. Despite the advances from these deep-pocketed companies, doubts remain over the security and accuracy of the financial information they can produce. For financial institutions, the need to safeguard customer information is critical, even existential, but many AI models aren’t designed specifically for combing through financial documents. 

“The biggest problem is that a lot of these general purpose models weren’t trained on information like this,” Kannappan says.

Even when the models from OpenAI, Meta, and Anthropic did provide answers, they were often wrong. Patronus ran different kinds of tests, including one known as a retrieval system, in which the researchers would ask the AI chatbots a question about SEC filings and other financial documents stored within a database. Results for this version proved to be poor.

In one setup, Meta’s Llama2, an open source model released earlier this year, answered 104 questions incorrectly, meaning it was wrong 70% of the time. It failed to provide any answer to the question 11% of the time. Llama2 only got the answer right 19% of the time, or just 29 out of 150 questions.

OpenAI’s GPT-4-Turbo fared only slightly better. It got the same number of correct answers 29 out of 150, along with fewer incorrect answers, just 20 total equaling 13% of the time. But it failed to provide any answer to 101 of the 150 questions, or 68% of the time. An accompanying research paper that Kannappan coauthored theorizes this might actually be a preferred outcome. “Models refusing to answer is arguably preferable to giving an incorrect answer as it creates less risk of error, and misplaced trust, by users,” the paper reads. 

When reached for comment OpenAI referred Fortune to its usage policies, which specify its models are unable to provide financial advice “without a qualified person reviewing the information.” Any financial institution that does would need to offer a disclaimer that it is doing so. Meta and Anthropic did not respond to a request for comment. 

In another trial of the experiment Patronus set up a different configuration of the retrieval system. This time the researchers set up a separate database for each file, rather than one database with many files, as is the case in most data storage. Here the models fared significantly better, but that system wouldn’t hold up in a real world scenario because it would require companies to create many databases containing just one file. It “is just impractical because if you’re a company that has 10,000 documents, you wouldn’t have 10,000 vector databases,” Kannappan says. 

The AI tools all performed better when they were directly fed the entire documents needed, known as “long context.” The researchers loaded the documents into the chatbot windows, where users usually type in queries. In this trial, GPT-4-Turbo got 79% of the 150 questions right, for a total of 118 correct answers, and 26 wrong answers—more than it did without the “long context.” It only failed to respond 4% of the time. In the same test Anthropic’s Claude2 performed slightly worse, with correct answers 76% of the time and 32 wrong answers out of 150 questions. 

While the added documents can help users make sure the chatbot is referencing the correct information, it only partially translates into increased accuracy. GPT-4-Turbo’s results show a clear improvement in the number of right answers and fewer non-answers, but seemingly at the expense of more wrong answers. That tradeoff will likely to make financial institutions wary. 

“In an industry setting, the high proportion of failures which are incorrect answers rather than refusals could be still concerning as it indicates a greater risk of hallucinations,” the research paper Kannappan coauthored says. 

Kannappan adds there’s another more logistical problem to this iteration of the experiment: Some documents are too long to be uploaded entirely and feature so many different graphs and tables, in addition to all the financial jargon, that they can confuse the AI model. It’s a major problem considering financial companies want to use AI on an ever-growing number of financial documents, like deal memos, pitch decks, or equity research, not just the publicly available SEC filings used in Patronus’ research. 

To round out its research Patronus tested a highly unlikely scenario to see how the model would react. In this setup ChatGPT was fed not just the “long context” of an entire SEC filing, but the specific section it would need to find the correct answer to a certain question. Predictably, this version performed the best with an 85% success rate, answering 128 out the 150 questions correctly. This too is impractical because it would require the person using the AI to already have located the correct information within a document that could be hundreds of pages long. Doing so, essentially defeats the purpose of using the AI, which analysts hope will be able to do research for them. “The whole point of these machines is that they make intelligent decisions, which means you can trust them,” Dhar says. 

For Dhar the question of automating financial analysis is about trust in the technology as much as it is about the capabilities of the technology itself. “The real question here is: can you really trust a machine with performing a task you’re giving it, meaning that you’re willing to accept its errors and their consequences just like you do with humans?”

Join us at the Fortune Workplace Innovation Summit May 19–20, 2026, in Atlanta. The next era of workplace innovation is here—and the old playbook is being rewritten. At this exclusive, high-energy event, the world’s most innovative leaders will convene to explore how AI, humanity, and strategy converge to redefine, again, the future of work. Register now.
About the Author
Paolo Confino
By Paolo ConfinoReporter

Paolo Confino is a former reporter on Fortune’s global news desk where he covers each day’s most important stories.

See full bioRight Arrow Button Icon

Latest in Tech

InvestingStock
There have been head fakes before, but this time may be different as the latest stock rotation out of AI is just getting started, analysts say
By Jason MaDecember 13, 2025
2 hours ago
Politicsdavid sacks
Can there be competency without conflict in Washington?
By Alyson ShontellDecember 13, 2025
2 hours ago
InnovationRobots
Even in Silicon Valley, skepticism looms over robots, while ‘China has certainly a lot more momentum on humanoids’
By Matt O'Brien and The Associated PressDecember 13, 2025
4 hours ago
Sarandos
Arts & EntertainmentM&A
It’s a sequel, it’s a remake, it’s a reboot: Lawyers grow wistful for old corporate rumbles as Paramount, Netflix fight for Warner
By Nick LichtenbergDecember 13, 2025
9 hours ago
Oracle chairman of the board and chief technology officer Larry Ellison delivers a keynote address during the 2019 Oracle OpenWorld on September 16, 2019 in San Francisco, California.
AIOracle
Oracle’s collapsing stock shows the AI boom is running into two hard limits: physics and debt markets
By Eva RoytburgDecember 13, 2025
9 hours ago
robots
InnovationRobots
‘The question is really just how long it will take’: Over 2,000 gather at Humanoids Summit to meet the robots who may take their jobs someday
By Matt O'Brien and The Associated PressDecember 12, 2025
23 hours ago

Most Popular

placeholder alt text
Economy
Tariffs are taxes and they were used to finance the federal government until the 1913 income tax. A top economist breaks it down
By Kent JonesDecember 12, 2025
1 day ago
placeholder alt text
Success
Apple cofounder Ronald Wayne sold his 10% stake for $800 in 1976—today it’d be worth up to $400 billion
By Preston ForeDecember 12, 2025
1 day ago
placeholder alt text
Success
40% of Stanford undergrads receive disability accommodations—but it’s become a college-wide phenomenon as Gen Z try to succeed in the current climate
By Preston ForeDecember 12, 2025
1 day ago
placeholder alt text
Economy
The Fed just ‘Trump-proofed’ itself with a unanimous move to preempt a potential leadership shake-up
By Jason MaDecember 12, 2025
22 hours ago
placeholder alt text
Economy
For the first time since Trump’s tariff rollout, import tax revenue has fallen, threatening his lofty plans to slash the $38 trillion national debt
By Sasha RogelbergDecember 12, 2025
24 hours ago
placeholder alt text
Success
Apple CEO Tim Cook out-earns the average American’s salary in just 7 hours—to put that into context, he could buy a new $439,000 home in just 2 days
By Emma BurleighDecember 12, 2025
1 day ago
Rankings
  • 100 Best Companies
  • Fortune 500
  • Global 500
  • Fortune 500 Europe
  • Most Powerful Women
  • Future 50
  • World’s Most Admired Companies
  • See All Rankings
Sections
  • Finance
  • Leadership
  • Success
  • Tech
  • Asia
  • Europe
  • Environment
  • Fortune Crypto
  • Health
  • Retail
  • Lifestyle
  • Politics
  • Newsletters
  • Magazine
  • Features
  • Commentary
  • Mpw
  • CEO Initiative
  • Conferences
  • Personal Finance
  • Education
Customer Support
  • Frequently Asked Questions
  • Customer Service Portal
  • Privacy Policy
  • Terms Of Use
  • Single Issues For Purchase
  • International Print
Commercial Services
  • Advertising
  • Fortune Brand Studio
  • Fortune Analytics
  • Fortune Conferences
  • Business Development
About Us
  • About Us
  • Editorial Calendar
  • Press Center
  • Work At Fortune
  • Diversity And Inclusion
  • Terms And Conditions
  • Site Map

© 2025 Fortune Media IP Limited. All Rights Reserved. Use of this site constitutes acceptance of our Terms of Use and Privacy Policy | CA Notice at Collection and Privacy Notice | Do Not Sell/Share My Personal Information
FORTUNE is a trademark of Fortune Media IP Limited, registered in the U.S. and other countries. FORTUNE may receive compensation for some links to products and services on this website. Offers may be subject to change without notice.