Why your A.I. is only as good as your data

Companies investing in artificial intelligence should be aware that their systems will produce inaccurate results if fed low-quality data.

After all, machine learning constantly adapts based on the data it ingests. Companies must therefore continually pay attention to the data they collect as it changes over time. 

“It’s no different from building any system, except in this case, the systems are very dynamic,” said Ian Wong, the chief technology officer of real estate tech company Opendoor.

Some of what Opendoor collects are public records, geographical data, and information sold or offered by third parties. The company uses this information to train machine-learning systems to forecast homes prices so that it knows how much to bid on properties, how much renovations may cost, and how much to resell the homes for.

Last fall, a similar home forecasting algorithm used by online real estate listing company Zillow failed to quickly take into account the rapid changes in the real estate market during the COVID-19 pandemic. As a result, Zillow ended up having to write off over $500 million and shut down its own home-flipping business.

Wong declined to comment about Zillow, only saying “Maybe the lesson we can take is that it’s hard,” referring to building A.I. for real estate.

Wong said Opendoor spends “a lot of internal energy and dollars and labor” to clean the data it uses for A.I. training. The company also routinely adds new data sources to improve the quality of its algorithms. Here’s one: Since the pandemic started, it began letting home sellers submit video and photo inspections that the company uses to improve its forecasting models, Wong said.

Still, as Zillow’s debacle showed, real estate prices can be difficult to predict, even for computers. 

Wong said Opendoor’s efforts to ensure the accuracy of its predictive systems partly involve training some of its machine-learning models “multiple times a day.” The company also created an executive committee that reviews what its A.I. systems produce every week.

He declined to elaborate about who is on the executive committee and what the committee’s discussions are like. Instead, he said the machine-learning review process is “part of the secret sauce,” though “no different from a business review.” 

Jonathan Vanian 


Come aboard. People interested in taking a ride in one of Cruise’s self-driving cars can now sign up for a waiting list, the company announced on Tuesday. Cruise did not say when the rides will start, but said they would be free to those who sign up. “We’re starting with a small number of users and will ramp up as we make more cars available, so sign up now to lock in an early spot,” Cruise co-founder Kyle Vogt wrote in a blog post.

Let’s keep this a secret. Waymo sued California's Department of Motor Vehicles to keep information that it deems “trade secrets” hidden from the public and competitors, according to a report by The Los Angeles Times. The information involves how Waymo vehicles handle emergencies, what the cars would do if they were driving in places they were not supposed to go, and descriptions of the autonomous vehicles’ crashes. “That’s among the information the DMV requires to determine whether to issue permits to deploy robot vehicles on public roads,” the report said.

Giving the robots another go. Walmart is partnering with the startup Brain Corp. to install robotic floor cleaners to help clean the warehouses of the company’s Sam’s Club stores, the Street reported. In November 2020, Walmart ended a contract with the startup Bossa Nova Robotics to use that company’s inventory-scanning robots at its stores.

Not worth sharing. The Crisis Text Line non-profit said it would stop sharing conversation data with the for-profit startup Loris.ai, which specializes in machine learning for analyzing conversations. The decision to end the data sharing comes after Politico reported on the practice, which alarmed several ethics and privacy experts, considering that people typically contact the Crisis Text Line while distressed and may not have understood that their conversations would be shared.


Firebolt hired Mosha Pasumansky to be the data analytics startup’s chief technology officer. Pasumansky was previously a principal engineer at Google, where he worked on the search engine’s BigQuery data technology team.

Wayfair picked Fiona Tan to be the e-commerce company’s CTO. Tan joined Wayfair in 2020 as the company’s global head of customer and supplier technology team. She was previously the head of technology for Walmart’s U.S. business division.


Evaluating the text. A.I. research company OpenAI released a paper detailing a new technique involving a deep-learning task called embeddings. In the A.I. subset of natural language processing, in which computers learn to understand language, embeddings are useful for discovering relationships between words and phrases. Embeddings can help A.I. systems better associate the phrase “canine companions say” with the word “woof” rather than “meow,” OpenAI said in a blog post about the paper.

However, A.I. researcher Nils Reimers, with the machine-learning startup Huggingface, evaluated OpenAI’s paper and related software tools and claimed that older A.I. embedding tools outperform OpenAI’s software and are cheaper to implement.

From Reimers post:

The English Wikipedia had in 2020 around 6 million articles with about 2 billion tokens. When broken down into paragraphs of 100 tokens each, this yields 21M paragraphs.

Using the OpenAI Davinci model, it would cost us over $1 million to encode all English Wikipedia articles. In contrast, SpladeV2 is based on a distilbert-base model, which can encode about 300 paragraphs per second on a T4-GPU. Using a preemptive T4 GPU on Google Cloud, we have costs of $0.13 per hour (as of 27.01.2022). Hence, encoding Wikipedia with SpladeV2 might cost as little as $2.50.


Inside Neuralink, Elon Musk’s mysterious brain chip startup: A culture of blame, impossible deadlines, and a missing CEO—By Jeremy Kahn and Jonathan Vanian

Toward data dignity: How we lost our privacy to Big Tech—By Tom Chavez, Maritza Johnson, and Jesper Andersen

Robot security guards are starting to patrol the nation’s theme parks—By Chris Morris

China’s Big Tech leads in VR and AR patent applications. But China’s metaverse may look different from everyone else’s—By Yvonne Lau


The limits of Watson Health. It wasn’t too long ago that IBM was pitching its Watson Health data analytics business as the future of healthcare. In January, however, IBM ended up selling the data and analytics assets of Watson Health to private equity firm Francisco Partners for an undisclosed amount. The deal underscored IBM’s troubles building an A.I.-healthcare specific business unit, despite spending roughly $5 billion acquiring health care companies like Truven Health Analytics to augment Watson Health.

In an interview with Slate, Casey Ross, a journalist for Stat News who has spent years chronicling Watson Health, describes some of the issues IBM experienced. Despite IBM’s stumbles, Ross still believes A.I. will play a big role in healthcare.

From the article: This will be a case study for business schools for decades. When you look at what IBM did and the strategy mistakes, the tactical errors that they made in pursuing this product, they made a lot of unforced errors here. It’s also true that the generation of technology that they had was nowhere near ready to accomplish the things that they set out to accomplish and promised that they could accomplish. I don’t think that the failure of Watson means that artificial intelligence isn’t ready to make significant improvements and changes in health care. I think it means the way that they approached it is a cautionary tale that lays out how not to do it.


Our mission to make business better is fueled by readers like you. To enjoy unlimited access to our journalism, subscribe today.

Read More

CEO DailyCFO DailyBroadsheetData SheetTerm Sheet