Why synthetic data is such a hot topic in the artificial intelligence world
Farm machinery giant John Deere needed a huge amount of data to train its tractors to “see” like farmers. But lacking enough images, the company’s artificial intelligence sometimes failed to distinguish corn from invasive plants, like when spraying weed killer.
While A.I. can identify a weed called itchgrass in sunshine, it sometimes gets confused when it’s cloudy. To overcome that hurdle, John Deere trains its A.I. on synthetic photos—mathematical representations that only computers can understand—that show the weed under noonday sun, gray skies, and damp with rain.
Using images created by computers helps eliminate the “need to take a picture of every single possible weed” under every condition, says Julian Sanchez, John Deere’s director of emerging technology.
Synthetic data, a relatively new development, may be key to A.I.’s ultimate success, some A.I. researchers say. The excitement is so great that a cottage industry has emerged to sell synthetic data services to business customers. In addition to imagery, they offer synthetic versions of data typically found in corporate spreadsheets.
But it’s still too early to say whether those synthetic data vendors will succeed, several industry experts tell Fortune. It’s also unclear whether the technology itself is good enough to improve machine-learning models.
This data looks real, but it’s fake
Unity Technologies is best known for its software, which developers use to create video games. But recently, the company, which went public in 2020, opened a new business unit for creating synthetic visuals for businesses to feed into computer-vision systems.
Manufacturers, for instance, could create synthetic visuals to help train technology that automatically scans merchandise on the factory floor for flaws, says Danny Lange, Unity’s senior vice president of artificial intelligence. A Unity spokesperson says that Boeing has used Unity’s technology to create synthetic images for training software used by Boeing mechanics to spot production flaws in airplanes.
Lange says that companies hate having to pay third parties $50,000 to add labels to training data so that machine-learning models can understand it when, in many cases, the models end up not working as well as hoped. This failure ends up forcing companies to spend even more on labeling.
Rendered.ai, a startup that just raised $6 million to help it grow its synthetic data business, makes tools for developers to create their own artificial imagery more easily, explains its CEO, Nathan Kundtz. X-ray technology company Quadridox used Rendered.ai’s software to generate fake airport X-ray scans of passenger bags that included images of banned weapons like explosives, guns, and knives.
This set of synthetic X-ray airport scans, which were spawned from an original data set of X-ray baggage scans, was used to improve Quadridox’s security screening technology, Kundtz says. In a real-world data set, he says, companies like Quadridox may only have access to real imagery of five different knives, but that “in the synthetic world, I can create an infinite number of variations,” he explains.
One synthetic photo resembles an X-ray scan of luggage, in which a handgun is located in the center of the image as if it were placed by a careless criminal. Thousands more could be created showing the weapon and others like it hidden in different parts of the luggage.
Another startup, Mostly AI, specializes in technology to help companies generate synthetic financial data sets, such as customer sales records. The idea behind the Austria-based company is that clients can create fake versions of their existing customer data to avoid violating Europe’s tough GDPR privacy laws.
Over the years, one unnamed European telecommunication client amassed an enormous amount of customer data, says Alexandra Ebert, chief trust officer at Mostly AI. But because many of the people on the list never consented to have their data collected, the client couldn’t use much of the information.
By creating synthetic data derived from the original data set for training its A.I., the company was able to do tasks like predict customer churn—business jargon for the number of people who would cancel their service in the future—without breaking GDPR rules, Ebert says.
“They could better cater to the needs of clients without risking privacy or being in conflict with privacy legislation,” she says.
Is it still too early for synthetic data?
Despite the enthusiasm over synthetic data, some experts are concerned about how useful the technology currently is. Sumit Agarwal, a senior analyst at Gartner who specializes in A.I., says that “the idea of synthetic data is very, very promising,” and that banks or health care companies could use artificial data to protect customer and patient privacy when they require data to develop machine-learning algorithms. But he says that startups offering synthetic data generating services need several more years to improve their technology and show that it works outside a few test cases.
“I think there’s a lot of work that has to happen for them to be in the state where companies can use them without any support,” Agarwal said of heavily regulated businesses using synthetic data services.
Duke Energy chief information officer Bonnie Titone said during a recent Fortune Brainstorm A.I. online event that its experiments using synthetic data resulted in A.I. systems that weren’t very good.
“We find it better to take our own data and kind of obfuscate it or use it in a non-identifiable manner, because there’s so much power in the information that was real life, whether it’s your customers or internal operations,” Titone said.
Some venture capitalists are hesitant to invest in synthetic data startups, particularly those selling services to create artificial financial data sets. Ariel Tseitlin, an investor with Scale Venture Partners, says he hasn’t yet found a compelling startup that creates synthetic data, and that there are other data anonymizing techniques that companies can already use to ensure that their real data is compliant with privacy laws.
“I don’t think that’s a great use case for synthetic data generation,” Tseitlin says.
Additionally, businesses with a bit of technical acumen can create their own fake data instead of using third parties. John Deere’s tech team, for instance, created its own synthetic images of crops and weeds to use.
Despite all these caveats, many businesses are optimistic about synthetic data’s potential.
Self-driving car companies like Waymo continue to use synthetic data that is gathered by “driving” their autonomous vehicles through virtual cities and roads, as a way to improve their A.I. systems. And American Express is creating synthetic financial data as part of its cybersecurity research. These companies are all hoping that the projects will eventually pay off.
As John Deere’s Sanchez says, his company will eventually test how well A.I. systems trained on synthetic data work compared with those using real-life data. Ultimately, he says, it will show “whether this form of sausage making works out and gives us some efficiency.”
Subscribe to Fortune Daily to get essential business stories delivered straight to your inbox each morning.