• Home
  • Latest
  • Fortune 500
  • Finance
  • Tech
  • Leadership
  • Lifestyle
  • Rankings
  • Multimedia
TechPointCloud

How Tech Made the Pulitzer Prize-Winning Panama Papers Coverage Possible

Barb Darrow
By
Barb Darrow
Barb Darrow
Down Arrow Button Icon
Barb Darrow
By
Barb Darrow
Barb Darrow
Down Arrow Button Icon
May 30, 2017, 1:32 PM ET

Reporters are relying more and more on troves of public—or leaked—data to do their jobs. And they also increasingly depend on technology tools to help organize and sift through all of that information.

Case in point: The Panama Papers, a massive leak of financial records from Panamanian law firm Mossack Fonseca obtained by the German newspaper Süddeutsche Zeitung and shared with the International Consortium of Investigative Journalists (ICIJ). That data led to huge scoops last year by journalists around the world that exposed a network of tax havens used by the rich and powerful in government and private industry. The stories led to the resignation of at least one head of state and embarrassed dozens of others including former U.K. prime minister David Cameron and Russian president Vladimir Putin.

But none of those stories would have appeared without a lot of work preparing the data. This was a mother lode: The Panama Papers comprised some 2.6TB of data and 11.5 million documents about Mossack Fonseca clients many of whom, it turned out, used the law firm and its affiliates to dodge taxes. NSA whistleblower Edward Snowden, who knows a bit about these things, called it the biggest leak in data journalism history. For context, 2.6TB of data would equal the capacity of about 390 DVDs, a stack of which would be nearly 21 feet high.

Related: Behind the Panama Papers

The data came into the ICIJ’s possession in dribs and drabs, and in many formats. Much of it was email and PDF files, of the sort that are created to be printed out and viewed. That document data is called unstructured since it does not come in the neat rows-and-columns of traditional databases. For this type of data the ICIJ team used three open-source, or free, tools—Tesseract software to scan the printed information; Apache Solr to index it and make it searchable, and Apache Tika to extract data from these documents.

Biggest leak in the history of data journalism just went live, and it's about corruption. https://t.co/dYNjD6eIeZpic.twitter.com/638aIu8oSU

— Edward Snowden (@Snowden) April 3, 2016

And much of the Mossack Fonseca data came from a traditional structured row-and-column database but arrived in very raw form, not in full database files that would normally be shared. It’s sort of like someone sends a list of letters and words instead of a fully formatted Word document—the information is there but is not all that useful.

Because of that, the ICIJ tech staff had to take the leaked information and rebuild the original SQL database structure it came from. The term SQL or structured query language, describes how these databases are set up and how users can request information from them.

Related: The Laughably Bad Security at Mossack Fonseca

From there, the ICIJ relied on an open-source version of Talend (TLND) software, known by techies as an “ETL” or extract, transform and load tool. Talend’s technology let the journalists take the row-and-column data structures they had painstakingly rebuilt and pump them into an open-source Neo4J graph database, which let reporters see onscreen icons representing people or organizations that are based on the original data.

Talend enabled the team take structured data from different sources, and automate the process of putting it all together. “It’s like a recipe. You create a job and get three columns of data from this source, and two from this source, intermix them in SQL form,” Mar Cabra, editor of the ICIJ’s Data & Research Unit told Fortune. Without a tool like Talend the team would have to write a ton of software code to do that.

The moments you win a #pulitzer prize – in @ICIJorg office in Washington D.C. What a day. What a year. Long live collaborative journalism!! pic.twitter.com/iE0TFgvP3o

— Bastian Obermayer (@b_obermayer) April 10, 2017

The next step was to use a commercial product called Linkurious which works with Neo4J to visualize the relationships between the people and organizations mentioned in the data. It creates a sort of interactive flow chart that lets users click on one party to see who that person is connected to based on the Mossack Fonseca data.

If that data were left in SQL form, finding relationships between people and organizations would require writing long and complicated database queries, Cabra said. “In a graph database, if a company is connected to you, and you are connected to other companies, reporters can follow that thread,” she added.

Get Data Sheet, Fortune’s technology newsletter.

At that point the Mossack Fonseca data trove could be shared by authorized reporters in Germany, the U.S., Spain, and elsewhere. Each reporting team could run their own queries, track down their own leads, and do their own reporting. All of that preparation work mentioned above made sure they were all working from one single source of Mossack Fonseca data.

Related: Panama Papers Law Firm Responds to Massive Hack Attack

In April of this year, after more than a year of work and a slew of articles, ICIJ members including Süddeutsche Zeitung, and the Miami Herald, were awarded the Pulitzer Prize for Explanatory journalism by Columbia University.

About the Author
Barb Darrow
By Barb Darrow
See full bioRight Arrow Button Icon

Latest in Tech

InvestingStock
There have been head fakes before, but this time may be different as the latest stock rotation out of AI is just getting started, analysts say
By Jason MaDecember 13, 2025
49 minutes ago
Politicsdavid sacks
Can there be competency without conflict in Washington?
By Alyson ShontellDecember 13, 2025
1 hour ago
InnovationRobots
Even in Silicon Valley, skepticism looms over robots, while ‘China has certainly a lot more momentum on humanoids’
By Matt O'Brien and The Associated PressDecember 13, 2025
3 hours ago
Sarandos
Arts & EntertainmentM&A
It’s a sequel, it’s a remake, it’s a reboot: Lawyers grow wistful for old corporate rumbles as Paramount, Netflix fight for Warner
By Nick LichtenbergDecember 13, 2025
8 hours ago
Oracle chairman of the board and chief technology officer Larry Ellison delivers a keynote address during the 2019 Oracle OpenWorld on September 16, 2019 in San Francisco, California.
AIOracle
Oracle’s collapsing stock shows the AI boom is running into two hard limits: physics and debt markets
By Eva RoytburgDecember 13, 2025
9 hours ago
robots
InnovationRobots
‘The question is really just how long it will take’: Over 2,000 gather at Humanoids Summit to meet the robots who may take their jobs someday
By Matt O'Brien and The Associated PressDecember 12, 2025
22 hours ago

Most Popular

placeholder alt text
Economy
Tariffs are taxes and they were used to finance the federal government until the 1913 income tax. A top economist breaks it down
By Kent JonesDecember 12, 2025
1 day ago
placeholder alt text
Success
Apple cofounder Ronald Wayne sold his 10% stake for $800 in 1976—today it’d be worth up to $400 billion
By Preston ForeDecember 12, 2025
1 day ago
placeholder alt text
Success
40% of Stanford undergrads receive disability accommodations—but it’s become a college-wide phenomenon as Gen Z try to succeed in the current climate
By Preston ForeDecember 12, 2025
1 day ago
placeholder alt text
Economy
The Fed just ‘Trump-proofed’ itself with a unanimous move to preempt a potential leadership shake-up
By Jason MaDecember 12, 2025
21 hours ago
placeholder alt text
Economy
For the first time since Trump’s tariff rollout, import tax revenue has fallen, threatening his lofty plans to slash the $38 trillion national debt
By Sasha RogelbergDecember 12, 2025
23 hours ago
placeholder alt text
Success
At 18, doctors gave him three hours to live. He played video games from his hospital bed—and now, he’s built a $10 million-a-year video game studio
By Preston ForeDecember 10, 2025
3 days ago
Rankings
  • 100 Best Companies
  • Fortune 500
  • Global 500
  • Fortune 500 Europe
  • Most Powerful Women
  • Future 50
  • World’s Most Admired Companies
  • See All Rankings
Sections
  • Finance
  • Leadership
  • Success
  • Tech
  • Asia
  • Europe
  • Environment
  • Fortune Crypto
  • Health
  • Retail
  • Lifestyle
  • Politics
  • Newsletters
  • Magazine
  • Features
  • Commentary
  • Mpw
  • CEO Initiative
  • Conferences
  • Personal Finance
  • Education
Customer Support
  • Frequently Asked Questions
  • Customer Service Portal
  • Privacy Policy
  • Terms Of Use
  • Single Issues For Purchase
  • International Print
Commercial Services
  • Advertising
  • Fortune Brand Studio
  • Fortune Analytics
  • Fortune Conferences
  • Business Development
About Us
  • About Us
  • Editorial Calendar
  • Press Center
  • Work At Fortune
  • Diversity And Inclusion
  • Terms And Conditions
  • Site Map

© 2025 Fortune Media IP Limited. All Rights Reserved. Use of this site constitutes acceptance of our Terms of Use and Privacy Policy | CA Notice at Collection and Privacy Notice | Do Not Sell/Share My Personal Information
FORTUNE is a trademark of Fortune Media IP Limited, registered in the U.S. and other countries. FORTUNE may receive compensation for some links to products and services on this website. Offers may be subject to change without notice.