Facebook claims big leap forward in computer vision with Instagram-trained A.I.

Our mission to make business better is fueled by readers like you. To enjoy unlimited access to our journalism, subscribe today.

Facebook has created an artificial intelligence system that may make it much more efficient for companies to train such software for a range of computer vision tasks, from facial recognition to functions needed for self-driving cars.

The company unveiled the new system in a series of blog posts Thursday.

Today, training machine-learning systems for such tasks often requires hundreds of thousands or even millions of labeled data sets. Creating an accurately labeled data set for this training can be both expensive and time-consuming.

Learning in baby steps

Facebook’s breakthrough allows an A.I. model to be trained from a very large set of unlabeled image data and then fine-tuned for a wide range of specific vision-related tasks using just a tiny fraction of the amount of labeled data that such software typically requires.

Yann LeCun, Facebook’s chief A.I. scientist, said that the idea is to create artificial intelligence that can learn the way a human infant does. That is, through observation and by building a mental model of the relationships between objects.

“Babies learn how the world works by watching the spectacle of the world,” LeCun told Fortune. “Once you have a good understanding and representation of the world, you can learn any task relatively quickly.”

This is why most teenagers can learn to drive with only a few hours of lessons, LeCun said. Today’s software for self-driving cars, on the other hand, requires millions of simulated hours to equal the same level of performance.

The ability to learn from far fewer labeled examples is critical for a wide range of commercial A.I. applications, LeCun said. In medical imaging diagnostics, for example, much of today’s computer vision software requires tens of thousands of annotated examples to reach the same accuracy as a human radiologist. But for a rare lung condition, there might not be tens of thousands of examples available to train such a system.

What you SEER is what you get

In recent years, the use of similar techniques in natural language processing has resulted in giant leaps forward in the capabilities of A.I. software. The latest technology can perform tasks such as language translation, document summarization, answering questions about a text, and writing long passages of coherent text from a simple human-written prompt. The same techniques have also allowed big performance improvements in speech recognition for digital assistants such as Amazon’s Alexa and Google Assistant.

Now Facebook hopes that its new A.I. will result in a similar leap forward in the capabilities of computer vision systems, and possibly also systems that can learn the relationship between images and the words that describe those images.

The new A.I., which Facebook calls SEER, is a breakthrough in a type of machine learning called self-supervision. This type of A.I. model uncovers the relationships in data on its own, using statistical methods, without the need for labeled data to act as a kind of instructor that tells the system how to link a given input to a given output. (SEER is a shortening of the phrase “self-supervised,” according to a Facebook blog post announcing the A.I. system.)

In this case, SEER is an ultra-large vision model, taking in more than 1 billion variables and having been trained on more than 1 billion images from publicly available Instagram accounts. This has also been the trend with self-supervision in natural language processing. Some of the best systems take in hundreds of billions of variables and are trained from data sets that include almost everything publicly available on the Internet.

On ImageNet, the field’s signature image identification benchmark test, SEER achieved an 84.2% accuracy, even though it had not been trained on that data. The results outperformed the best previous self-supervised systems that had been trained for that task.

SEER also outperformed the best systems that have been trained from labeled data on tasks such as object detection, segmenting an image into component parts, and image classification. When given just 10% of the labeled ImageNet examples to train on, SEER still achieved 77.9% accuracy on the full ImageNet data set. With only 1% of the annotated ImageNet examples, the A.I. achieved a 60.5% accuracy.

Doing it for the ’Gram

Although Facebook is not yet using SEER or any other fully self-supervised computer vision A.I. on its own social networks, LeCun says that the company does use a system that is weakly supervised, having been trained on images paired with Instagram hashtags. It is this A.I. that allows Facebook to group users’ photos for them thematically and also allows the company to automatically detect hateful images or terrorist propaganda. LeCun said he thought SEER, or software based on the same underlying algorithms, would likely become the company’s base computer vision system, which will then be fine-tuned for specific use cases in the near future.

LeCun acknowledges that the size of these very large, self-supervised A.I. systems, and the expense of the computer hardware needed to train and run them, can intimidate business executives and academic researchers alike. But he pointed out that new advances in computer chips designed specifically to run large neural networks, a kind of machine-learning software design loosely based on the human brain that underpins most recent advances in A.I., including SEER, were significantly outpacing the growth of these large machine-learning systems. In other words, the cost of training should decline in the future.

LeCun also noted that even the most massive artificial neural networks in use today have about as many connections as a mouse brain. Creating machines that could equal human intelligence would almost certainly require much larger software systems.

LeCun, who is a past recipient of the Turing Award, computer science’s highest accolade, was dismissive of concerns about the carbon footprint of these large, self-supervised A.I. models. He said all the data centers in the world consumed only about 1% to 2% of the planet’s electricity, with the training and use of A.I. algorithms making up an even smaller fraction of that amount. He also said the new computer chips designed for A.I. were more energy efficient than older hardware at running large A.I. systems. So that even if this software kept growing in size, its energy footprint per decision should decline over time.

Ethical dilemmas

LeCun said he took more seriously another ethical concern raised about these ultra-large, self-supervised systems: that because they are trained from massive amounts of Internet data, they can pick up the biases, including racial and gender stereotypes, inherent in that data. Often these biases are not obvious until the system is deployed. Because the training data sets are so large, auditing them for bias can be difficult.

LeCun said that removing such biases from self-supervised systems might require a specialized training of the A.I. with an additional, smaller data set curated essentially to unteach the system a specific bias. More research would need to be done to figure out the best way to do this.

“It’s a complicated issue,” LeCun said. But self-supervised systems were, in his view, less likely to be biased than some A.I. software that learns from labeled examples, he said, because the labels are often applied by biased humans and the data sets were smaller, so each biased example would have a bigger impact.

The Facebook scientist’s views on the potential bias of A.I. systems have gotten him into trouble in the past. Last year, LeCun temporarily abandoned Twitter after a spat over the social media platform with Timnit Gebru—an A.I. ethics researcher who was recently, and controversially, pushed out from Google after raising concerns about large, self-supervised language models. Some other computer scientists accused LeCun of being tone-deaf and unfairly high-handed in his back-and-forth with Gebru, who is one of the few prominent Black women in A.I. research. The dispute centered on the nature of potential harms A.I. systems could cause and how much responsibility machine-learning researchers needed to take for addressing them.

LeCun says that the next steps for the self-supervised techniques behind SEER is to extend them from still images to video. That’s no easy feat, he said, acknowledging that trying to develop A.I. systems that have enough understanding of the world that they can accurately predict what will happen next in a video was a problem that had stymied computer scientists, himself included, for years. Another ripe research area is “multimodal learning,” in which an A.I. system is trained on both images and text simultaneously.

At the heart of the SEER system is an algorithm that Facebook calls SwAV, short for “swapping assignments between multiple views,” that involves clustering images. First the algorithm applies some distortions to an image; in this case, a series of crops that are used to create multiple “views” of the same picture. Then the algorithm tries to accurately determine which cluster it should assign the original image to based on these alternative views.

This new method allows the system to be trained much more efficiently. Training of this sort required a sixth of the data needed in previous methods that were based on comparing just two image views at a time, according to Facebook’s blog post.

Facebook is making the SwAV algorithm open source and free for anyone to use. The company is also making available a set of components for creating self-supervised computer vision systems and benchmarking tools for those systems, called VISSL, to the public.

This story has been updated to make clear that Facebook is not making its trained SEER A.I. model or the Instagram data set used to train it available to the public.