Israeli startup DataGen raises $18.5 million to train A.I. with fake data

Subscribe to Eye on A.I. for expert weekly analysis on the intersection of artificial intelligence and industry, delivered free to your inbox.

Companies interested in using artificial intelligence face a big obstacle: Having enough of the right kind of data to train their systems.

Companies need large amounts of labelled, historical examples to train A.I. systems, particularly those that work with images and videos. The demand has spawned a whole sub-industry of companies that specialize in helping other businesses annotate their data. Among them are Scale AI, which was valued at $3.5 billion in a December 2020 funding round, Hive, Sama, Labelbox, Cloudfactory, and a division of A.I. company Clarifai, among others.

But there is another way to produce enough data to train A.I. systems: Fabricate it.

Fake it till you make it

That’s essentially what a fast-growing Israeli startup called DataGen specializes in doing. The company uses its own machine learning systems to create what’s known as “synthetic data”—in this case, artificially generated still and video images—that DataGen’s customers then use to train their own A.I.

DataGen can produce a bespoke synthetic dataset for its customers in just a few hours, says Ofir Chakon, DataGen’s founder and chief executive officer. Compare that to the months it typically takes a data labelling company to curate an equivalent real world video or image library.

Synthetic data has other advantages too, in addition to speed. With synthetic data, companies don’t need to worry about any personal identifying information in the dataset, nor need they worry about ethical considerations around how the data was collected. This feature gains significance as more and more of the world’s population is covered by data protection laws. Gartner, the technology analytics firm, says that by 2023, 65% of the world’s population will have their personal data covered by some sort of privacy regulation, up from just 10% last year.

Data bias can still be a problem though. A synthetic dataset can, in some cases, simply replicate the same biases found in a real dataset. But DataGen has ways to potentially eliminate bias. The company can shape the dataset it generates however it wishes, allowing the company to create a lot more examples of unusual or rare cases to ensure that an A.I. system will know how to handle these. For instance, what will happen to a robot that uses a video camera to “see” as it navigates around a warehouse if there is a power cut and the warehouse’s low-level emergency lighting switches on? Acquiring enough examples of these rarer cases is far more difficult with real world datasets.

“Our customers have full control over all the parameters that go into the data they create,” Chakon says. “The real-world implication is that, once deployed, you can be sure it’s going to work well in different domains, with different ethnicities, in different geographic locations or any environment you can imagine.”

An enabler for the whole A.I. industry

DataGen has attracted some big name investors.

On Tuesday, the company announced a $18.5 million early stage funding round lead by Israeli venture capital funds TLV Partners and Viola Ventures. The round also includes an impressive list of machine learning luminaries. They include Michael Black, a computer vision pioneer who is now director of the Max Planck Institute for Intelligent Systems; Gal Chechik, director of A.I. research at computer chip giant Nvidia; Anthony Goldbloom, the chief executive officer and cofounder of machine learning competition site Kaggle; and Trevor Darrell, a computer science professor at the University of California at Berkeley. Existing investor Spider Capital is participating in the new funding round too.

Rona Segev, a founding partner at TLV, says that simulated data “addresses problems which are just unsolvable without it.” She says that synthetic data is “an enabler for the whole A.I. industry. Without simulated data, the industry will slow.”

DataGen said it would use the funds to hire more machine learning experts and engineers, expanding from the 30 employees it has currently, most of them based in Israel. Chakon said the company would also expand its focus from creating training sets for machine learning to data that is also designed to test those A.I. systems once they have been trained.

The future product plans aim to address a major problem with a lot of A.I. systems: quality assurance. Oftentimes, only a small subset of available data is reserved for testing an A.I. As a result, it may be hard for a company to test enough rare situations to know how well an A.I. will perform if it encounters the same or similar situations in the real world.

DataGen’s cofounders Ofir Chakon, CEO, (left) and Gil Elbaz, tech chief (right) create so-called synthetic data to train A.I. systems.

The startup, which was founded in 2018, has about 10 paying customers so far, “all of them big companies,” Chakon says, although he says contractual agreements prevent him from naming them. DataGen’s data has been used to train warehouse robots to pick items off a conveyor belt, help with factory operations for a home appliance manufacturer, and for a number of physical security applications, such as identifying shop lifters in a retail store.

Just like the real thing

“We are experts in everything that has to do with indoor environments and human perceptions,” Chakon says, adding that the company can also simulate the way people move in indoor environments. “We generate data that looks exactly like the target domain.”

In other words, a set of DataGen-created images of various household items in a crate—a scene used to train a robot picking arm in a logistics warehouse—looks just like the real overhead video images taken of those objects in a real crate on a real warehouse conveyor. A fabricated scene of a kitchen looks as if the company had gone out and commissioned photography of a real kitchen. And a simulation of a person’s face displays all the same movement points, textures, and skin tones as would be found in a real photograph or video.

DataGen uses software that represents objects and people as a kind of three-dimensional mesh, allowing the user to easily edit and adjust their size and shape. The company pairs that visual meshwork with a physics simulator to create realistic scenes of how objects moves. By doing so, the company can easily depict what happens when one object moves on top of or in front of another, potentially obscuring a clear view of that object from a certain angle.

DataGen uses a machine learning technique called a GAN, short for “generative adversarial network,” to create its realistic simulations. GANs also underpin the creation of so-called Deepfakes, which are a kind of synthetic data, but Deepfakes exist only in a two-dimensional representation of a person’s face, not a three-dimensional one.

Chakon says he thinks DataGen’s use of 3-D simulation gives it an advantage over other companies that are trying to use 2-D photos and videos to create synthetic data. He said it is much more difficult to simulate the interaction of objects—particularly when one object obscures or occludes another or two objects collide—accurately with just two-dimensional data.