How Tech Made the Pulitzer Prize-Winning Panama Papers Coverage Possible
Reporters are relying more and more on troves of public—or leaked—data to do their jobs. And they also increasingly depend on technology tools to help organize and sift through all of that information.
Case in point: The Panama Papers, a massive leak of financial records from Panamanian law firm Mossack Fonseca obtained by the German newspaper Süddeutsche Zeitung and shared with the International Consortium of Investigative Journalists (ICIJ). That data led to huge scoops last year by journalists around the world that exposed a network of tax havens used by the rich and powerful in government and private industry. The stories led to the resignation of at least one head of state and embarrassed dozens of others including former U.K. prime minister David Cameron and Russian president Vladimir Putin.
But none of those stories would have appeared without a lot of work preparing the data. This was a mother lode: The Panama Papers comprised some 2.6TB of data and 11.5 million documents about Mossack Fonseca clients many of whom, it turned out, used the law firm and its affiliates to dodge taxes. NSA whistleblower Edward Snowden, who knows a bit about these things, called it the biggest leak in data journalism history. For context, 2.6TB of data would equal the capacity of about 390 DVDs, a stack of which would be nearly 21 feet high.
Related: Behind the Panama Papers
The data came into the ICIJ’s possession in dribs and drabs, and in many formats. Much of it was email and PDF files, of the sort that are created to be printed out and viewed. That document data is called unstructured since it does not come in the neat rows-and-columns of traditional databases. For this type of data the ICIJ team used three open-source, or free, tools—Tesseract software to scan the printed information; Apache Solr to index it and make it searchable, and Apache Tika to extract data from these documents.
And much of the Mossack Fonseca data came from a traditional structured row-and-column database but arrived in very raw form, not in full database files that would normally be shared. It’s sort of like someone sends a list of letters and words instead of a fully formatted Word document—the information is there but is not all that useful.
Because of that, the ICIJ tech staff had to take the leaked information and rebuild the original SQL database structure it came from. The term SQL or structured query language, describes how these databases are set up and how users can request information from them.
From there, the ICIJ relied on an open-source version of Talend (TLND) software, known by techies as an “ETL” or extract, transform and load tool. Talend’s technology let the journalists take the row-and-column data structures they had painstakingly rebuilt and pump them into an open-source Neo4J graph database, which let reporters see onscreen icons representing people or organizations that are based on the original data.
Talend enabled the team take structured data from different sources, and automate the process of putting it all together. “It’s like a recipe. You create a job and get three columns of data from this source, and two from this source, intermix them in SQL form,” Mar Cabra, editor of the ICIJ’s Data & Research Unit told Fortune. Without a tool like Talend the team would have to write a ton of software code to do that.
The next step was to use a commercial product called Linkurious which works with Neo4J to visualize the relationships between the people and organizations mentioned in the data. It creates a sort of interactive flow chart that lets users click on one party to see who that person is connected to based on the Mossack Fonseca data.
If that data were left in SQL form, finding relationships between people and organizations would require writing long and complicated database queries, Cabra said. “In a graph database, if a company is connected to you, and you are connected to other companies, reporters can follow that thread,” she added.
Get Data Sheet, Fortune’s technology newsletter.
At that point the Mossack Fonseca data trove could be shared by authorized reporters in Germany, the U.S., Spain, and elsewhere. Each reporting team could run their own queries, track down their own leads, and do their own reporting. All of that preparation work mentioned above made sure they were all working from one single source of Mossack Fonseca data.
In April of this year, after more than a year of work and a slew of articles, ICIJ members including Süddeutsche Zeitung, and the Miami Herald, were awarded the Pulitzer Prize for Explanatory journalism by Columbia University.