CEO DailyCFO DailyBroadsheetData SheetTerm Sheet

Privacy-preserving A.I. is the future of A.I.

June 16, 2020, 3:01 PM UTC

This is the web version of Eye on A.I., Fortune’s weekly newsletter covering artificial intelligence and business. To get it delivered weekly to your in-box, sign up here.

I spent part of last week listening to the panel discussions at CogX, the London “festival of A.I. and emerging technology” that takes place each June. This year, due to Covid-19, the event took place completely online. (For more about how CogX pulled that off, look here.)

There were tons of interesting talks on A.I.-related topics. If you didn’t catch any of it, I’d urge you to look up the agenda and try to find recordings of the sessions on YouTube.

One of the most interesting sessions I tuned into was on privacy-preserving machine learning. This is becoming a hot topic, particularly in healthcare, and especially now due to the interest in applying machine learning to healthcare records that the coronavirus pandemic is helping to accelerate.

Currently, the solution to preserving patient privacy in most datasets used for healthcare A.I. is to anonymize the data: In other words, personal identifying information such as names, addresses, phone numbers, and social security numbers is simply stripped out of the dataset before it is fed to the A.I. algorithm. Anonymization is also the standard in other industries, especially those that are heavily regulated, such as finance and insurance.

But researchers have shown that this kind of anonymization doesn’t guarantee privacy: There are often other fields in data, such as location, age, or occupation, that might allow you to re-identify an individual, especially if you are able to cross-reference it with another dataset that does include personal information.

Privacy-preserving machine learning, by contrast, promises much more security—in fact, most methods offer mathematical certainty that the individual records cannot be re-identified by the person training or running the A.I. algorithm. But it’s got advantages and disadvantages. (One of the big disadvantages is that some privacy preserving methods are less accurate. Another is that some privacy preserving methods require more computing power or take longer to run.)

Last week, Eric Topol, the cardiologist who is both a huge believer in the potential for A.I. to transform healthcare and a notable skeptic of the hype so far about A.I. in healthcare, took to Twitter to highlight a paper published in Nature on the potential use of federated learning, a privacy-preserving machine learning technique, to build much larger and better-quality datasets of medical images for A.I. applications.

As the CogX panelists noted, the ability to draw insights from large datasets without compromising critical personal information is of potential interest far beyond healthcare: It could help industries create better benchmarks without compromising competitive information, or help companies serve their customers better without having to collect and store vast amounts of personal information about them.

Blaise Thomson, who is the founder and chief executive officer of Bitfount, a company creating software to enable this kind of insight-sharing between companies (and who sold his previous company to Apple), went so far as to say that privacy-preserving A.I. could strike a blow against monopolies. It could, he argued, help reverse A.I.’s tendency to reinforce winner-takes-all markets, where the largest company has access to more data, cementing its market leadership. (He didn’t mention any names, but ahem, Google, and, cough, Facebook.) Thomson is a fan of a privacy-preserving method called multi-party computation, where random noise is added to data before it is used to train an A.I. algorithm.

M.M. Hassan Mahmoud, the senior A.I. and machine-learning technologist at the U.K.’s Digital Catapult, a government-backed organization that helps startups, explained federated learning. It functions as a network, where each node retains all its own data locally and uses that data to train a local A.I. model. Aspects of each local model are shared with a central server, which uses the information to build a better, global model that is then promulgated back down to the nodes.

The problem: Coordinating all that information sharing requires specialized software platforms, and right now, the different software systems for running federated learning from different vendors (Google has one, graphics chip giant Nvidia has one, and China’s WeBank has another) are not compatible. So Mahmoud’s team built, as a proof of concept, a federated-learning system that could function across all these softwares. “It’s a great time to, as a community, build a common, open, scalable core that can be trusted by everyone,” Mahmoud said.

The final panelist was Oliver Smith, the strategy director and head of ethics for Health Moonshot at Telefonica Innovation Alpha. That’s a branch of Telefonica, the Spanish telecommunications firm, that works on transformative digital projects, including, in this case, mobile apps to help people with mental health. Smith said his group had investigated six different techniques for implementing privacy-preserving A.I. “My dream that we could take one technology and apply it to all of our uses cases is not really right,” he concluded. Instead, each use case was probably best suited to a different technique.

But Smith was clear about the potential of the whole field: “All of these techniques hold the promise of being able to mathematically prove privacy,” he said. “This is much better than anonymization and that is where we need to get to.”

It’s clearly a trend that anyone implementing an A.I. system —especially one that deals with personal information—ought to be thinking hard about.

With that, here’s the rest of this week’s A.I. news.

Jeremy Kahn
@jeremyakahn
jeremy.kahn@fortune.com

A.I. IN THE NEWS

Amazon and Microsoft join the bandwagon to stop selling facial recognition software to U.S. police—at least for now. Amazon said it would stop selling facial recognition software to U.S. police departments for at least a year. Microsoft announced it would not sell such systems to U.S. law enforcement until "until we have a national law in place, grounded in human rights, that will govern this technology." The moves follow IBM's decision to exit the market for "general purpose" facial recognition technology. The protests for racial justice following the killing of George Floyd have brought renewed scrutiny to police use of facial recognition software—which research has shown is less accurate for darker-skinned people—and the ways in which it can exacerbate already biased law enforcement.

Microsoft's robot editor confuses two mixed-race Little Mix singers. Microsoft is in the process of replacing many of the human editors who select stories, choose images and write headlines for its MSN News service with A.I. software. But rollout of the new A.I. editors hit a snag, The Guardian reported, when the software chose to illustrate a story about Little Mix singer Jade Thirlwall's reflections on racism with a photograph of fellow group member Leigh-Anne Pinnock, who is also mixed race. The mistake, which Microsoft says it corrected as soon as it became aware of it, highlights issues around potential bias in A.I. algorithms. (In another twist on how A.I.-driven systems can go wrong, Microsoft also had to intervene to stop its robot editor from featuring news coverage about its own mistake on MSN.com.)

OpenAI launches its first commercial product. The San Francisco A.I. research group launched its first commercial product since standing up a for-profit arm last year. The product is a software interface that lets customers use OpenAI's new GPT-3 natural language processing algorithm. The algorithm can generate long passages of coherent text in almost any style, answer factual questions and perform a number of other language-based tasks. The first companies trialing the software include legal software company Casetext, Reddit, Middlebury College, and video-game maker Latitude. My Fortune Eye on A.I. colleague Jonathan Vanian has more here

Idemia wins the contract for EU biometric i.d. program. The French security company won a European Union contract to run the algorithms that match biometric data, including facial imagery, when people present passports at the bloc's ports of entry and borders, technology site OneZero reported. The contract covers a program for about 400 million people who were born outside EU but work inside the zone for non-EU based companies. The contract makes Idemia, which already has contracts to process biometric identification for many parts of the U.S. government, among the world's largest custodians of facial and other biometric data, the site says.

U.K. farmers are testing fruit-picking robots amid Covid-19 picker shortage. With coronavirus-related travel restrictions making it harder for British farmers to bring in seasonal workers from Eastern Europe, some growers have joined a consortium to trial new robots that can harvest vegetables and even delicate fruit, such as strawberries, Bloomberg News reports

Facial recognition was used to figure out who was watching outdoor advertising at the Rose Bowl. An A.I. startup from Philadelphia called VSBLTY surreptitiously scanned the faces of about 30,000 people who attended this year's Rose Bowl in Pasadena, California, back in January, according to a report in OneZero. The company's software tracked whether—and for how long—those attending the game looked at advertising displays on screens at the stadium and also tried to determine their age and gender. This information can be used to help the stadium's marketing teams sell advertising. The same system also tried to spot anyone carrying weapons or who were on watch lists of "suspicious people" to help the venue with security.

Chinese researchers play a vital role in U.S. A.I. success. That's the conclusion of a New York Times investigation that examined the contribution Chinese-educated students make to American leadership in artificial intelligence. Many of these students come to the U.S. for graduate school, where they participate in cutting-edge research, and many then stay on in the U.S. to work for big U.S. technology companies and A.I. startups, the Times discovered. The story comes amid increasing calls among some China hawks, including several prominent Republican lawmakers, to bar Chinese nationals from studying computer science and other STEM subjects in the U.S.

Microsoft finds some machine learning servers were hijacked to mine cryptocurrency. The software giant said it had discovered several server clusters running its Kuberflow machine learning framework had been infected with malware that turned the servers into zombie cryptocurrency mining machines, according to a story in technology publication The Register. Kuberflow is popular because it allows developers to use Google's machine learning programming language, Tensorflow, combined with Kubernetes, the open-source development and management platform for containerized applications. Machine learning workloads make attractive targets for crypto mining hackers, The Register noted, because they frequently tap a lot of computing power, including graphic processing units and other accelerators—which are pretty useful for crypto miners too. 

Facebook touts the use of its object detection algorithms in mineral mining application. The company said in a blog that two Australian companies took its open-source Detectron2 algorithm, which is a computer vision system that can be trained to recognize various kinds of objects in images, and used it create a system to analyze drill core samples. The system, called DataRock, makes determinations about rock strength, fracture risk and other calculations essential to the mining and energy sectors. It was created by DiUS, a technology company, and Solve Geosolutions, a data science consultancy that caters to the mining sector.

EYE ON A.I. TALENT

Sharktower AI, a company in Edinburgh, Scotland, that makes A.I.-enabled project management software, has appointed Brendan Waters as chief financial officer, according to the trade publication Business CloudWaters previously served as CFO of Wilson Design Source Supply and earlier in his career had been CFO at gaming company FanDuel

Signal AI, a technology firm in the U.K., has appointed Georgie Weedon as head of communications, according to PRWeek. She had previously worked as a communications strategist for the United Nation’s Intergovernmental Panel on Climate Change. 

EYE ON A.I. RESEARCH

Facebook reveals winners of its contest to detect deepfakes. The social media company ran a competition for machine learning experts to create software able to automatically detect deepfakes, those pernicious fake videos created through clever machine learning algorithms, which some fear will soon be used for political disinformation. The competition showed just how difficult this problem is, with the winning algorithm only able to detect about 65% of real deepfakes it encountered. Nonetheless, Facebook said such a system might help the platform deter less sophisticated bad actors and could help lessen the work load on its human content reviewers. I reported on the results for Fortune here.

FORTUNE ON A.I.

George Floyd protests, coronavirus face masks pose challenges for facial recognition—by Jeremy Kahn

Snap wants machine learning experts to make more animated messages—by Jonathan Vanian

Microsoft follows IBM and Amazon in barring police from using its facial-recognition technology—by Jonathan Vanian

Buzzy research lab OpenAI debuts first product as it tries to live up to the hype—by Jonathan Vanian

BRAIN FOOD

Facebook thinks language models can be used for fact-checking. Researchers at Hong Kong University and Facebook AI Research have proposed, in a paper posted to the research repository arxiv.org, that state-of-the-art language models, which are frequently pre-trained on vast data sets, will suck up enough factual knowledge during this training that they could reliably serve as fact-checkers for online content such as social media posts.

They took a factual claim—such as "Thomas Jefferson founded the University of Virginia"—and masked, or hid, a key word in that claim, like the word Virginia. Then they asked the algorithm for its best prediction for the missing word. If the prediction matched the missing word, the system marked the claim as true. If it didn't, it judged the claim false.

The system is much simpler than other fact-checking software that has been developed, which has tended to rely on separate components for finding the best source of information on a subject and then retrieving evidence from that source to back up or refute a claim.

In tests, the language model approach performed OK—about as well as a simple evidence-retrieval system—but not great. The best accuracy the researchers obtained on a test set was 57%, which is far behind the best state-of-the-art fact-checking system built the traditional way, with about 77% accuracy.

Even larger language models—such as OpenAI's new GPT-3, which is more than 500 times bigger than BERT, the model the researchers used—might yield slightly better results. In fact, OpenAI seems to suggest that this kind of factual question answering might be a good commercial use for GPT-3.

But I think there's an inherent deficiency with this approach compared to the traditional ones that rely on a trusted knowledge base (like the Encyclopedia Britannica). These massive language models are pre-trained on text from all over the place. They ingest the complete works of Shakespeare, but they also gorge on threads from Reddit and other corners of the Internet not known for being particularly scrupulous with the truth. As a result, they ingest a lot of rubbish.

Ask GPT-2, OpenAI's previous state-of-the-art language model, to complete the sentence, "Lee Harvey Oswald was killed by..." and the algorithm generates the text, "a hit team of CIA operatives on November 22, 1963 in Dallas." Prompt it with, "The link between man-made carbon dioxide emissions and climate change is..." and it completes the sentence with "...still very much debated, and the science is murky on the point."

Given that automatic fact-checking tools are being touted as a defense against rampant disinformation on social media, these do not seem like auspicious results. Worse, if you knew large pre-trained language models were going to be used in this way, you could easily fool future systems by posting deliberate fictions to Reddit or elsewhere on the Internet. The language-based models would ingest these fictions during pre-training and then, when they encountered the claims again, judge them to be true. But, of course, no one would ever want to do such a thing, would they? (Here's looking at you, Vladimir.)