How Lockheed Martin cleans dirty healthcare data
If any company knows a thing or two about sifting through mountains of data, defense contracting giant Lockheed Martin is surely near the top of the list.
Besides developing air-to-missiles and weapons systems, the multi-billion dollar company also helps customers with their technology infrastructure and stitching together disparate databases. But that task is not as easy as it might seem because data is often messy and disorganized.
Ravi Hubbly, a senior engineering manager for Lockheed Martin (LMT), knows just how tedious the job can be. Hubbly works with Lockheed Martin’s health and life sciences group, which works with federal agencies and medical companies to improve how they process information they collect about patients, drug trials, and billing.
Five years ago, the U.S. Department of Health and Human Services unveiled its ambitious Health.Data.gov project to make more government healthcare data available to companies and government agencies. The idea was that by making more data easily available to download, organizations could develop new software and services to help improve the health care industry’s efficiency.
The problem, however, is that a lot of that data can be really dirty, Hubbly explained.
Hubbly’s 20-person team works with healthcare clients to build systems that can sift through large amounts of healthcare data and identify fraud. By comparing data from the government’s health care initiative with internal corporate data, health care companies can potentially spot when they are being scammed.
“You need to see the full health lifecycle,” said Hubbly on the importance of comparing multiple data sets — like Medicare payments records and physician databases.
However, before health care companies can start crunching numbers to uncover crooked doctors who bill for bogus cancer treatments, all that healthcare data has to be cleaned up and uniform, Hubbly explained. But all too frequently, it is full of incorrect information and unfilled fields.
For example, when a drug gets released to the market, patients and third parties can provide data to the FDA about any adverse drug reactions they may experience, Hubbly said. A patient or doctor could have easily logged in a dose of 20 tablets of one drug instead of two tablets, thus making the information inaccurate.
And it’s not just poorly entered data that can interfere with efforts to analyze data. Before analyzing multiple datasets, technicians must merge the databases in what’s known as a “join.”
A major problem that plagues companies when merging huge amounts of information is that it may be correct in one database but incorrect in another that they want to compare it with. One database could contain the name “Hewlett-Packard” to represent the enterprise technology giant while another might use the abbreviation, “HP.” Both are technically correct, but they both contain different data points that can complicate things.
In order to clean up healthcare data, and any sort of mixed-up data for that matter, Lockheed Martin uses the services of a startup called Trifacta to help sort through the information. The company is one of many new startups—including Tamr and Paxata—that have been raking in millions from investors in recent years amid a boom in data analysis.
While cleaning data to prep for analysis isn’t a new idea, the technologies now available has made the process faster and more efficient. For one thing, these data cleaning technologies work in conjunction with the open-source big data technology Hadoop, which acts as a giant digital repository that companies can dump their data into without any “limit to how much data can be processed,” Hubbly said.
Startups like Trifacta also include machine-learning algorithms in their technology that helps them learn how to best to modify the data just the way a customer wants. In the case of merging two databases together containing both “Hewlett-Packard” and “HP,” data analysts can enter that they want the system to automatically recognize those words as the same thing. The algorithms help train the system to learn from the analysts’ actions so the next time those words appear in the database, it will know how to group them together.
The system basically automates the time-consuming task of having to manually sift through the different databases.
Data that previously took three to four weeks to prepare for analysis can now be handled almost instantaneously with the new data-cleaning tools on the market, Hubbly said.
It should be noted that using technology to clean data is not limited to only the healthcare industry. Any business looking to analyze data should take steps to verify that their working with clean information. A telecommunications company that’s been acquiring businesses, for example, will often be inundated with a mish-mash of information that it must scrub before combining it with its master data.
But as Hubbly explained, having to merge data is not longer the nightmare situation it used to be, when company coders had to manually cobble together ways to automate the task. With the new tools on the market, coders require less time to help the business side clean data that in their effort to improve their bottom line.
Subscribe to Data Sheet, Fortune’s daily newsletter on the business of technology.
For more on data, check out the following Fortune video: