A.I. is Everywhere, But Human Judgment Matter More Than Ever

This is the web version of Eye on A.I., Fortune’s weekly newsletter on developments in artificial intelligence and machine learning. To get it delivered weekly to your in-box, sign up here.

I spent last week at Web Summit in Lisbon, where one of the key takeaways from the event was the extent to which, at many companies, machine learning has moved well beyond proof-of-concepts to being deployed across the business in ways that are having massive, real-world impacts.

Werner Vogels, Amazon’s chief technology officer, gave a keynote detailing how machine learning underpins absolutely everything that the Everything Store does, from recommending products to figuring out which to stock in each fulfillment center to safely operating its new delivery drones.

One slide from Vogels’s talk really struck me: While Amazon has been using various machine learning techniques since its founding to help forecast demand for individual products, Vogels said annual improvements had been merely incremental until 2015. That is the year Amazon switched to using deep learning methods—the kind of A.I. that is based on multi-level neural networks, which mimic, in a loose way, the connections in the human brain.

And what happened? Amazon saw a 15-fold increase in the accuracy of its forecasts, a leap that has enabled Amazon to roll-out its one-day Prime delivery guarantee to more and more geographies.

The A.I. profiles of lots of other companies are starting to look more like Amazon’s. Case in point: Mastercard. Ajay Bhalla, who heads cyber and intelligence solutions for the payments company, told me it has used A.I. to cut in half the number of times a customer has their credit card transaction erroneously declined, while at the same time reducing fraudulent transactions by about 40%.

Mastercard has also used predictive analytics to spot cyberattacks and waves of fraudulent activity by organized crime groups. Bhalla says this has helped its customers avoid some $7.5 billion worth of damage from cyber attacks in just the past 10 months. And, he says, Mastercard is now using A.I.-based software across every section of the company, from human resources to finance to marketing.

***

While Web Summit was all about the promise of A.I., this news from last week ought to give people pause: the National Transportation Safety Board released its preliminary report investigating how one of Uber’s self-driving cars came to strike and kill 49-year old Elaine Herzberg as she crossed the road in Tempe, Arizona, last year.

The NTSB found in its “Vehicle Automation Report” that while the car’s sensors did detect Herzberg six seconds before hitting her, the self-driving system failed to correctly classify her as a pedestrian, in part because Uber had trained its computer vision system to only expect pedestrians in designated cross-walks. What’s more, the agency concluded that Uber’s engineers had programmed the car to only brake or take evasive maneuvers if its computer systems were highly confident that a collision was likely.

Humans decided to train the system in this way and set these tolerances. Most likely, this was done to prioritize the comfort of Uber’s passengers, who would have found sudden braking and unexpected swerves annoying and alarming.

And it’s ultimately these human decisions that doomed Herzberg.

The report should be required reading for anyone considering real-world consequences while deploying A.I.

Jeremy Kahn
@jeremyakahn
jeremy.kahn@fortune.com

A.I. IN THE NEWS

The Government and Tech Sector Must Coordinate on A.I. That was the conclusion of the National Security Commission on A.I.'s first report to Congress. The commission, which is chaired by former Google CEO Eric Schmidt and includes representatives from Amazon, Google, Microsoft and Oracle, said the U.S. currently leads the world in A.I., but that more must be done to make sure U.S. military and intelligence agencies are benefitting from technology developed in Silicon Valley. The report notes that China is quickly gaining on U.S. capabilities and that Chinese government-sponsored R&D spending was on track to exceed America's within a decade.

Twitter Announces Policy on Deepfakes. The social media company released its new internal rules for how it will handle deepfakes— realistic-looking fake videos created using machine learning—and other "manipulated media." The company, according to a report in Techcrunch, says that it will place warning labels next to tweets that contain manipulated content, and link, when possible, to news articles and other sources that would give readers more truthful information. Left unsaid is how exactly Twitter plans to detect deepfakes and other fraudulent content, a problem that has stumped some of the best minds in computer science so far. (Facebook and Microsoft are supporting a "Deepfake Detection Challenge" encouraging researchers to identify A.I. models that can suss out the fakes created by other models.)

U.S. Regulators Are Called on to Investigate HireVue. The Electronic Privacy Information Center has asked the U.S. Federal Trade Commission to investigate HireVue, whose A.I.-powered software has been used by more than 100 firms to screen video interviews with job candidates. The group says the workings of HireVue's A.I. are too opaque and violate FTC rules against "unfair and deceptive" hiring practices. The company has not commented on the allegations.

U.S. Department of Defense and Philips Team Up to Predict Infection. Philips worked with the Department of Defense on a project to develop an A.I. model able to predict which hospital patients were likely to develop infections. In tests on existing data, which included vital signs as well as hospital lab results for patients already admitted to hospitals, the software was able to successfully make forecasts to 48 hours before doctors diagnosed infection. Now Philips and DoD plan to look at whether a similar A.I. system can be used to forecast infections among a healthy population—such as soldiers—equipped with wearable devices to monitor vital signs, like temperature and heart rate.

OpenAI Releases Full-Scale Version of Its "Too Dangerous to Release" Language Model. The San Francisco-based A.I. research shop has released the full-size version of its language modeling algorithm, GPT-2, which can compose whole paragraphs of fairly-coherent text from just a few seed words or sentences. When it unveiled the model in February, the company said it was declining to make the most powerful version of the software—which has 1.5 billion parameters—available to the public out of fear it could be abused to create fake news. At the time, many in the A.I. research community criticized that decision as a publicity stunt. OpenAI says it has reversed course now because, since February, it has released gradually more powerful versions of GPT-2 and seen little evidence of misuse.

Forget Learning to Code. In the Future, Code Will Write Itself

Speaking of GPT-2: At Microsoft's Ignite developer conference last week, the company showcased how OpenAI's language model could be used to create an auto-complete feature for lines of software code. Microsoft's team took the language model and trained it on the 3,000 top-rated open-source code repositories on Github. The result is a system that suggests, as a coder types, the most likely completion of a line of code. Microsoft says the system can be fine-tuned for a specific team of coders by training it on their particular code base. This is just one of several examples of A.I. simplifying—or sometimes even automating (see Google's AutoML, for example)—the act of writing software. So if you thought learning to code was a guarantee of employment in the face of relentless A.I.-driven automation, think again.

(Of note: Microsoft bought Github for $7.5 billion in 2018. The company also recently invested $1 billion into OpenAI in a deal that gives the software giant the right to commercialize some of OpenAI's research.)

EYE ON A.I. TALENT

London-based A.I. research firm Faculty is hiring well-known computer science researcher Stuart Russell, currently at the University of California, Berkeley, as a special advisor to help lead Faculty's A.I. safety research.

Prowler.io, an A.I. firm based in Cambridge, England that aims to automate decision-making in areas such as finance and logistics, has appointed Gary Brotman as vice president of product and marketing. Brotman was previously at Qualcomm Technologies, where he served as head of A.I. strategy and product planning.

EYE ON A.I. RESEARCH

Lightning could strike. Researchers at the Swiss Federal Institute of Technology Lausanne created a machine learning model that can predict where and when lightning will strike from basic weather station data. But before all the power company executives out there get too excited, the algorithm was pretty crude: it could only successfully predict a strike within a 30 kilometer radius and within half an hour of the actual strike, which is not accurate enough for most use cases.

Is that a T-shirt or an invisibility cloak? Researchers from Northeastern University, M.I.T., and IBM Research have collaborated on a project to create T-shirts that allow wearers to evade facial recognition systems. The T-shirts are printed with very specific patterns that can fool the algorithm underpinning the computer vision system. They prevent the algorithm from drawing a bounding box—the first step in most object or facial recognition systems—around the individual wearing the T-shirt.

FORTUNE ON A.I.

Workers Are Worried Robots Will Steal Their Jobs. Here’s How to Calm Their Fears — By Anne Fisher

How A.I. Can Ease the Pain of Booking Your Next Vacation — By Eamon Barrett

Collaborate or Isolate? The U.S. Tech World Is Watching China’s Advances in A.I.—Warily — By Naomi Xu Elegant

For Now, Autonomous Cars May Mean Never Having to Park Again— By Fortune Editors

BRAIN FOOD

What do we mean when we talk about intelligence? That's the question that Francois Chollet, a well-known A.I. researcher at Google (he is the original author of the Keras deep learning library) asks in a recent paper.

Chollet points to the disconnect between how A.I. researchers gauge progress and how non-A.I. folks think about intelligence:

The computer researchers, Chollet argues, tend to judge intelligence by how well a model performs on a specific skill-based test (for example, how well the system plays old Atari games, or answers questions about a specific text, or translates text from one language to another).
But most people outside of A.I., he says, tend to view intelligence in terms of how efficiently it learns and how capable it is of applying knowledge across fields.

This disconnect is problematic, Chollet says. A.I. researchers throw ever more data and computing power at narrow problems without bothering too much about how efficiently their systems learn or how transferable their model's "intelligence" is.

Then, when these systems successfully master some complex task, it gives the public a false expectation that the software is able to perform other similarly complex tasks.

"As humans, we can only display high skill at a specific task if we have the ability to efficiently acquire skills in general ... No one is born knowing chess, or predisposed specifically for playing chess. Thus, if a human plays chess at a high level, we can safely assume that this person is intelligent, because we implicitly know that they had to use their general intelligence to acquire this specific skill over their lifetime, which reflects their general ability to acquire many other possible skills in the same way."

The same assumption, he says, doesn't work for machines.

Chollet argues that if we are ever going to move beyond narrow A.I. towards the Holy Grail of "artificial general intelligence," the field is going to need a benchmark that actually captures most people's understanding of intelligence. He proposes a new benchmark training and testing data set, which he calls the Abstraction and Reasoning Corpus (ARC). It consists of training and evaluation tasks similar to what you'd find on an IQ test.

While the test is difficult, smart humans can solve the evaluation tasks, while no current machine learning system can.

It isn't clear ARC will catch on as a benchmark. But Chollet is probably right about the chasm between how A.I. researchers and the general public perceive intelligence—and the problems that ensue.