ChatGPT and failed to understand SEC documents, study

Wall Street financial types who hoped that AI would soon take over the tedious job of sifting through page after page of earnings reports might have to wait a little longer. It turns out AI models from some of the most prominent tech companies, including OpenAI, Meta, and Anthropic, are exceptionally bad at analyzing SEC filings and other financial documents.

The poor results mean that many tasks in the financial services industry that require lots of math and research will still have to be done by humans. Although financial companies have started to explore AI tools in recent months, research from startup Patronus AI published in November demonstrates they may still be unable to perform more complex tasks without human oversight.

For its study, Patronus AI, cofounded by two ex-Meta employees, fed large language models from OpenAI, Meta, and Anthropic varying amounts of data from SEC documents and then asked the AI to respond to 150 questions. It found that the models were either unable to respond to the prompts they were given or spit out incorrect answers, or “hallucinations” in AI lingo. For Patronus CEO Anand Kannappan this outcome was largely expected.

“After speaking with folks in the financial services industry over several months who experienced this hands-on, we suspected something like this might happen,” he tells Fortune. “We heard about how analysts are spending lots of time crafting prompts and manually inspecting outputs, only to discover hallucinations often.”

When companies run tests meant to limit hallucinations, or any other issues, they’re sometimes done haphazardly, he adds. That’s partly because refining the model takes time, is expensive, and requires substantial expertise, either internally, or from outside help—further raising the costs. “A lot of companies approached it like a vibe check,” Kannappan says. “Just testing by inspection—people just cross their fingers and hope everything’s going to be okay.”

Best practices for AI testing are still a work in progress, like AI itself. “I don’t know that there are any,” says Vasant Dhar, a professor at NYU’s Stern School of Business, who also founded SCT Capital Management, a hedge fund that uses machine learning to make investment decisions. “People haven’t really spent the time, or the effort or exercised the discipline involved in conducting proper scientific evaluations of this methodology.”

Some financial companies have joined the AI race themselves, by starting a variety of projects including AI for helping financial advisors with some of their day-to-day duties. But they’re not tailor made for deciphering SEC documents, like what Patronus tested. For example, Bloomberg published research about a large language model that specializes in financial analysis, with much of its data coming from Bloomberg’s data operations. Meanwhile, BNY Mellon opened an AI research center in Dublin, Ireland. Others like JPMorgan and Goldman Sachs are developing their own generative AI tools in-house. Despite the advances from these deep-pocketed companies, doubts remain over the security and accuracy of the financial information they can produce. For financial institutions, the need to safeguard customer information is critical, even existential, but many AI models aren’t designed specifically for combing through financial documents.

“The biggest problem is that a lot of these general purpose models weren’t trained on information like this,” Kannappan says.

Even when the models from OpenAI, Meta, and Anthropic did provide answers, they were often wrong. Patronus ran different kinds of tests, including one known as a retrieval system, in which the researchers would ask the AI chatbots a question about SEC filings and other financial documents stored within a database. Results for this version proved to be poor.

In one setup, Meta’s Llama2, an open source model released earlier this year, answered 104 questions incorrectly, meaning it was wrong 70% of the time. It failed to provide any answer to the question 11% of the time. Llama2 only got the answer right 19% of the time, or just 29 out of 150 questions.

OpenAI’s GPT-4-Turbo fared only slightly better. It got the same number of correct answers 29 out of 150, along with fewer incorrect answers, just 20 total equaling 13% of the time. But it failed to provide any answer to 101 of the 150 questions, or 68% of the time. An accompanying research paper that Kannappan coauthored theorizes this might actually be a preferred outcome. “Models refusing to answer is arguably preferable to giving an incorrect answer as it creates less risk of error, and misplaced trust, by users,” the paper reads.

When reached for comment OpenAI referred Fortune to its usage policies, which specify its models are unable to provide financial advice “without a qualified person reviewing the information.” Any financial institution that does would need to offer a disclaimer that it is doing so. Meta and Anthropic did not respond to a request for comment.

In another trial of the experiment Patronus set up a different configuration of the retrieval system. This time the researchers set up a separate database for each file, rather than one database with many files, as is the case in most data storage. Here the models fared significantly better, but that system wouldn’t hold up in a real world scenario because it would require companies to create many databases containing just one file. It “is just impractical because if you’re a company that has 10,000 documents, you wouldn’t have 10,000 vector databases,” Kannappan says.

The AI tools all performed better when they were directly fed the entire documents needed, known as “long context.” The researchers loaded the documents into the chatbot windows, where users usually type in queries. In this trial, GPT-4-Turbo got 79% of the 150 questions right, for a total of 118 correct answers, and 26 wrong answers—more than it did without the “long context.” It only failed to respond 4% of the time. In the same test Anthropic’s Claude2 performed slightly worse, with correct answers 76% of the time and 32 wrong answers out of 150 questions.

While the added documents can help users make sure the chatbot is referencing the correct information, it only partially translates into increased accuracy. GPT-4-Turbo’s results show a clear improvement in the number of right answers and fewer non-answers, but seemingly at the expense of more wrong answers. That tradeoff will likely to make financial institutions wary.

“In an industry setting, the high proportion of failures which are incorrect answers rather than refusals could be still concerning as it indicates a greater risk of hallucinations,” the research paper Kannappan coauthored says.

Kannappan adds there’s another more logistical problem to this iteration of the experiment: Some documents are too long to be uploaded entirely and feature so many different graphs and tables, in addition to all the financial jargon, that they can confuse the AI model. It’s a major problem considering financial companies want to use AI on an ever-growing number of financial documents, like deal memos, pitch decks, or equity research, not just the publicly available SEC filings used in Patronus’ research.

To round out its research Patronus tested a highly unlikely scenario to see how the model would react. In this setup ChatGPT was fed not just the “long context” of an entire SEC filing, but the specific section it would need to find the correct answer to a certain question. Predictably, this version performed the best with an 85% success rate, answering 128 out the 150 questions correctly. This too is impractical because it would require the person using the AI to already have located the correct information within a document that could be hundreds of pages long. Doing so, essentially defeats the purpose of using the AI, which analysts hope will be able to do research for them. “The whole point of these machines is that they make intelligent decisions, which means you can trust them,” Dhar says.

For Dhar the question of automating financial analysis is about trust in the technology as much as it is about the capabilities of the technology itself. “The real question here is: can you really trust a machine with performing a task you’re giving it, meaning that you’re willing to accept its errors and their consequences just like you do with humans?”

Join us at the Fortune Workplace Innovation Summit May 19–20, 2026, in Atlanta. The next era of workplace innovation is here—and the old playbook is being rewritten. At this exclusive, high-energy event, the world’s most innovative leaders will convene to explore how AI, humanity, and strategy converge to redefine, again, the future of work. Register now.

A startup tested if ChatGPT and other AI chatbots could understand SEC filings. They failed about 70% of the time and only succeeded if told exactly where to look