Why teaching A.I. to read is a lifelong endeavor

October 27, 2020, 4:34 PM UTC

It’s not just tech giants that are using artificial intelligence to understand human language, so that products like digital assistants can respond to basic questions.

More conventional businesses are also increasingly using a subset of A.I. called natural language processing (NLP) to create more powerful software to help answer basic customer call center queries or create summaries of long, complicated documents. 

LexisNexis, for instance, has been using NLP to improve the legal research software that lawyers, journalists, and analysts use to find relevant court documents. It’s light years ahead of the user unfriendly Boolean search system that I regularly used over a decade ago as a cub reporter.

With A.I., LexisNexis’ search interface is more intuitive. That’s partly because the company used Google’s free, open-source language model BERT as the foundation. The BERT model, trained on a vast amount of web data including Wikipedia pages, helps software better understand how some words mean different things depending on the context in which they appear.  

But LexisNexis can’t use BERT for all of its language needs because the company deals with information that is specific to the legal industry. This particular data can’t be found on the open web, which means the information doesn’t come baked into BERT.

Min Chen, vice president and chief technology officer for the Lexis Nexis Asia-Pacific and global search team, said that BERT “provides a good base model to start with.” But the company must fine-tune the technology with additional legal data so that it understands legal linguistics.

This fine-tuning is increasingly common for many companies operating in areas like finance or healthcare. Every industry has its own lingo that makes no sense in another context.

Chen said it took LexisNexis 12 months to train a version of BERT that understands case citations and even Latin. If someone wants to find a document showing that a case has been adjudicated, or closed, the technology knows to look for documents with the Latin term res judicata (claim preclusion, or a matter decided). 

As Amanda Stent, an NLP expert for financial news and information service Bloomberg, explained, technologies like BERT are important because they remove a lot of the grunt work required to train a language model from scratch. For a 10-word sentence, Stent said, “the combinations [of words] are astronomical,” and having a powerful language model like BERT as a starting point is very helpful.

But as other A.I. researchers have pointed out, because language models are typically trained on Internet data, they sometimes parrot back the offensive text they’ve scanned. You’ll be happy to know that companies can take precautions to make this less likely.

Stent and her colleagues recently published a best practices that companies can follow when training A.I.-powered language models and other machine learning systems. They recommended using human subject-matter experts to help annotate and label the text used for training (to ensure data is labelled accurately) and ensuring that product managers and engineers coordinate on big projects (to help ensure that problems don’t slip through the cracks).

The goal is to eliminate any problems before companies introduce new products. After all, no user wants to be bombarded with vile language.  

One thing companies should be prepared for is that data training projects are never done. There’s always room for improvement. 

Said Stent, “It never stops.”

Jonathan Vanian 


Speaking of NLP. Eye on A.I.’s Jeremy Kahn takes a look at AI21 Labs, a NLP-focused startup founded by prominent machine learning researchers that aims “to fundamentally transform how we read and write." As opposed to other language models like OpenAI’s GPT-3, Kahn writes that the startup’s “system is a fusion between neural network-based language models and an older form of artificial intelligence that seeks to represent human knowledge, like vocabulary and the meaning of words, in a graph structure.”

Enter the A.I. Threat Matrix. The non-profit and security focused MITRE Corporation, Microsoft, IBM, Nvidia, Bosch and a host of other companies teamed up to release the Adversarial ML Threat Matrix, which VentureBeat described as “an industry-focused open framework designed to help security analysts to detect, respond to, and remediate threats against machine learning systems.” The goal is to help companies better secure their machine learning systems by thoroughly understanding all of the ways hackers can crack modern A.I. software. The authors of the threat matrix said via GitHub, “Data can be weaponized in new ways which requires an extension of how we model cyber adversary behavior, to reflect emerging threat vectors and the rapidly evolving adversarial machine learning attack lifecycle.

How to bring “dead languages” back to life. MIT researchers are using machine learning to “automatically decipher lost languages that can no longer be understood,” technology publication CNET reported. The researchers created an algorithm that analyzes the patterns of how languages develop over time to help uncover the forgotten languages.  From the report: “Going forward, the team hopes to expand its work to identify the semantic meaning of words, even if they're not readable yet. It ultimately hopes to be able to resurrect lost languages using just a few thousand words.”

The FDA sounds the A.I. bias alarms. Bakul Patel, the director of the U.S. Food and Drug Administration’s new Digital Health Center of Excellence, explained during an online meeting how biased and unclean data could cause machine learning software to misfire and “negatively impact patient care,” industry publication MedTech Dive reported. “We don't want to set up a system and we would not want to figure out after the product is out in the market that it is missing a certain type of population or demographic or other aspects that we would have accidentally not realized," Patel said.


Censiahas picked Deborah Leff to join the enterprise software startup’s board. Leff was previously the global leader and industry chief technology officer for data science and A.I. at IBM.

Nautilus hired Garry Wiseman to be the fitness company’s senior vice president and chief digital officer. Wiseman was previously the senior vice president of digital customer experience for Dell Technologies.


When auditing A.I. research, look at the conferences. Technology analysis website TechTalks looks into a recent research paper describing the review process that researchers face when attempting to submit their papers to The International Conference on Learning Representations. The authors of the research paper, who are currently anonymous, claim that they have found some problems with the submission process, including “evidence for a gender gap, with female authors receiving lower scores, lower acceptance rates, and fewer citations per paper than their male counterparts.”

As TechTalk notes, the research paper notes several instances of bias, including the conference organizers showing “significant preference for Carnegie Mellon, MIT, and Cornell universities.” Researchers who published their papers on the popular arXiv preprint server prior to submission also did better, especially if they came from those top-tier universities.

From TechTalk:

Interestingly, their research did not find a significant bias toward large tech companies such as Google, Facebook, and Microsoft, which house reputable AI researchers. At first glance, this is a positive finding, because big tech already has a vast influence over commercial AI and, by extension, on AI research.

But as other authors have pointed out, the same academic institutions that are very well represented at AI conferences serve as talent pools for big tech companies and receive much of their funding from those same organizations. So this just creates a feedback loop of a narrow group of people promoting each other’s work and hiring each other at the expense of others.


Former Facebook employee’s new book exposes Big Tech’s dirty secrets—By Danielle Abril

Startup cofounded by A.I. heavy hitters debuts editing tool it hopes will ‘transform writing’—By Jeremy Kahn

Is it time for a new agency to oversee Big Tech? Many say yes—By Jeff John Roberts

Here’s what Amazon’s new Echo speakers are like—By Jonathan Vanian

How Lyft became the company with nine lives—By Beth Kowitt

A possible semiconductor shortage looms over Huawei’s new smartphone launch—By Naomi Xu Elegant


A.I. takes to space. Researchers have found machine learning technology to be an excellent tool for analyzing space data. In 2017, for instance, NASA and Google used neural networks to comb through imagery data captured from the Keplar space telescope, and uncovered a couple of planets far outside of our solar system. More recently, researchers from NASA’s Jet Propulsion Laboratory have used machine learning to identify recently formed craters on the surface of Mars. Space.com reports:

Scientists have fed the algorithm more than 112,000 images taken by the Context Camera on NASA's Mars Reconnaissance Orbiter (MRO). The program is designed to scan the photos for changes to Martian surface features that are indicative of new craters. In the case of the algorithm's first batch of finds, scientists think these craters formed from a meteor impact between March 2010 and May 2012. 

Read More

CEO DailyCFO DailyBroadsheetData SheetTerm Sheet