Facebook is using machine learning to build better computer vision

Facebook today launched its Moments product, which uses Facebook’s image recognition abilities to scan your photos for your friends and then lets people create private photo albums with a particular group, such as the people in the photo. The idea is to make it easier to share photos from a big event among attendees without the cumbersome process of emailing snapshots to everyone or the awkward end-of-event huddle while six people take the exact same group shot. It’s not a cure for cancer, but behind the scenes of this new feature is an impressive technology that Facebook has been working on for years.

A key element of the Moments feature is the ability for Facebook’s algorithms to recognize people’s faces across different photos, so that Moments knows who was at the event. This requires computer vision expertise that companies such as Google, Microsoft, Baidu, and others are currently researching for everything from self-driving cars to silly web products such as Microsoft’s How Old Do I Look?

In launching the Moments product Facebook is sharing data about its own successes in computer vision research. Namely, that Facebook can recognize faces with a 98% accuracy, and it can do so quickly—the company says it can identify you in one picture out of 800 million in less than 5 seconds. Finally, it can do all of this even if it doesn’t have the full frontal shot of your face (or even if your face isn’t in the photo at all), thanks to a machine learning algorithm that can look at other elements in the picture and associated with the photo’s data.

150615051503-facebook-moments-app-780x439 — Facebook

Inside Moments
Fortune spoke with Yann LeCun, Facebook’s director of artificial intelligence research, to understand how his team helped a computer understand who you are, and where Facebook is heading next with its AI research. Perhaps the first thing to understand is that when LeCun discusses computer vision, it’s not the same as how a person sees, although the process of teaching software how to recognize an object has some similarities.

For example, Facebook’s facial recognition, which is the basis of the current efforts, can’t identify you. It only can recognize if a person in one photo is the same as a person in another photo. Identification is a completely separate step.

Because Facebook is about connecting people, its computer vision efforts have focused on recognizing faces as opposed to cats, cars, or other non-human subjects. To do this, it uses a database of celebrity and politicians photos called Labeled Faces in the Wild. This collection of images has 13,000 photos of people with different hairdos, different outfits, sometimes wearing glasses and more. Facebook used this collection to train its machine learning algorithms. Other companies have used this data set as well, and some universities have even trained systems with a higher than 98% accuracy rate using Labeled Faces.

So how did Facebook get from giving a machine a picture of Angelina Jolie to somehow using that photo to help identify your sister across different photo albums on Facebook? LeCun is the man to ask. About 20 years ago when he was working at Bell Labs (now AT&T’s Image Processing Research Department), he happened upon a way of thinking about teaching computers to see that wasn’t really used outside of academia until about three years ago.

How computers learn to see
That technique is called convolutional neural networking, and takes its name from both a mathematical operation called a convolution, and inspiration from how the human brain learns. The brain learns by establishing connections between neurons, and the more often a signal is sent over those neurons, the denser those connections get. In a similar vein, when computers establish similarities between two images it assigns a weight to those similarities. In convolutional neural networks, the goal is to train the machine to recognize the changes in weights between those connections so it can tell with increasing accuracy if the image matches.

The process of doing this is incredibly complicated and involves different calculations that work to establish how important certain aspects of the image are to the actual process of recognizing what the image is. For example, if you want to train a computer to recognize faces, the pixels related to the background are less important. The tricky—and frankly amazing— part of this is that the machine learns on its own how to tell what part of the image is most relevant, and then can generalize those relationships going forward. It still takes a lot of human effort to nudge the computer into recognizing the right way to weight the similarities, but once the model is built, it can generalize going forward.

The process can take a few days on a powerful computer.

[youtube https://www.youtube.com/watch?v=9T45bs9di_U]

Convolutional neural networks have become the basis for almost all of the computer vision research done today, after a team of researchers led by Geoffrey Hinton at the University of Toronto, used that technique to win a competition where image recognition algorithms vie to be most accurate. Hinton, whose team and startup were later acquired by Google, won the competition with a test error rate of 15.3%, compared to 26.2% for the second-place winner.

Don’t post that photo!
As research continues, the opportunities for use in our day-to-day life are significant. Yes, there is the ability to match people’s faces in a crowd that might lead to greater government surveillance, but there is also an opportunity to use better facial recognition to manage your privacy. For example, with automatic facial recognition at scale, any picture of you uploaded to Facebook (or perhaps even the web) could result in a notification.

For example, if you are somehow captured in the background of a tourist shot of Times Square, you could get a notification and the option to blur your face. Applied to children, the blurring or removal could be automated. LeCun notes that Facebook is interested in such tools, but also stresses that Facebook’s interest in machine learning goes far beyond image recognition.

[fortune-brightcove videoid=4278647741001]

Facebook’s goal is to get a computer to understand empathy. Obviously, it won’t be able to feel what humans do, but it can be trained to recognize what emotions are and how people will react. With that level of understanding, Facebook could, say, offer a warning when you are about to post a photo of you drunk and ask if you really want to do that.

“This would not be face recognition,” said LeCun. “We don’t care who is in the picture. We would use other types of image recognition and train them differently to say that this looks embarrassing and then tap you on the shoulder to make sure you want to post this publicly.”

This isn’t something Facebook can do today, but LeCun offered these concepts as a thought experiment to show where Facebook could head with its AI research. Of course, this sort of expertise informed by an algorithm can make people deeply uncomfortable. Today Facebook doesn’t turn on its auto tagging features in countries like Canada and the EU because of privacy concerns, and there’s a certain creep factor in having a computer second guess your photo-sharing choices or having software trying to parse your jokes to try to understand what you find funny.

“What we’d like to do is make machines more intelligent, understanding text, images, videos and posts,” LeCun said. “Anything that can happen in the digital world we want to understand the context.” Because there is so much digital content people could easily become overwhelmed by the information flooding their feeds. The efforts of LeCun’s team will help connect people with the content that is most relevant to their interests and priorities. It’s a complex solution to a simple goal: to make sure that you see what you want to see on Facebook.

“That’s the big mission that we at Facebook are trying to fulfill,” LeCun said. “Machines that understand people.”

Exclusive

How Facebook is teaching computers to see