Given the rampant chatter in Silicon Valley about companies moving software and data from their data centers to a shared public cloud, you might expect they considered what that would mean for their critical business software applications.
But you would be wrong, according to one expert who helps corporations and government agencies make this journey.
“When companies move to the cloud, they’re often not doing a basic engineering task which is to look at all their applications and figure out what happens if they fail,” Vishwas Lele, chief technology officer at consulting firm Applied Information Sciences, tells Fortune.
Most big companies are aware of basic best practices for cloud computing. For example, they know they should distribute applications and data across different data centers within one location—or “availability zone” in Amazon Web Services parlance. That means if there’s a power failure in one facility, the application will keep running from another. This applies across different geographic regions as well, say from data centers between Oregon and Northern Virginia.
But because these massive data centers rely on tens of thousands of servers, so some sort of hardware failure is inevitable. The key is to figure out which of a company’s applications are most critical to its ongoing operations and “build a reliability bubble” around those applications, Lele says.
Unfortunately, we cannot eliminate such cloud failures. So what can we do to protect our apps from failures? The answer is to conduct a systematic analysis of the different failure modes, and have a recovery action for each failure type.
“You look at each application, see what can go wrong, and create a check list,” Lele tells Fortune now. “Then you look at the potential for failure, and figure out how to mitigate it for the most important applications.”
IT professionals can think ahead to avoid a small cloud glitch from spiraling. They can, for example, add what Lele calls “retry logic” to an important application. If the application hits a snag, this logic triggers an automatic retry. It’s similar to hitting reboot on your computer before calling the help desk. This logic means that if a resource is temporarily unavailable to an application, the application will know enough to retry in a few seconds instead of shutting down. This, he continues, is something that IT pros don’t have to do when running applications in-house, where they manage all the systems.
Tim Crawford, a strategic advisor with AVOA, a Los Angeles-based consultancy, agreed, noting that the problem is not that public cloud services fail but that many people mistakenly think that they won’t fail.
“Many enterprises assume that cloud infrastructure is resilient in away that’s similar to their corporate data centers,” he noted. However, in-house data centers are built with resiliency at many levels whereas public cloud infrastructure is not.
Thus cloud applications need to be smarter and more aware of what’s going on around them than they had to be when running in corporate server rooms. Hence the need to add things like retry logic to the mix.
On the other hand, some industry watchers argue that companies shouldn’t overthink any cloud move, but just do it.
“I think companies do enough due diligence moving to cloud—sometimes even too much if they hesitate too long,” says Holger Mueller, an analyst with Constellation Research. The real issue, in his view, is that once companies are already using a public cloud, they do not necessarily build and deploy their applications in a way that can adapt to unexpected outages.
The lesson from the recent AWS outage, he says, is that many new “born to the web” businesses do not write code to ensure high-availability and disaster recovery. “Even Amazon’s own dashboard didn’t work in this case,” Mueller says. “There is too much talk before moving to the cloud and too little work once they’re already in the cloud to take advantage of the cloud’s capabilities and requirements.”
He also points to the failure of the DYN’s domain name system (DNS) last October, which took down many popular web sites, including Twitter (TWTR), Etsy (ETSY), Spotify, GitHub, SoundCloud, and Salesforce-owned Heroku (CRM). (A DNS serves as a directory for Internet sites and converts human-recognizable Internet domain names—like Twitter.com—into numbers readable by the computers running the Internet. If the DNS system fails, sites can become inaccessible.)
Get Data Sheet, Fortune’s technology newsletter.
The lesson: “You can’t rely on a single DNS provider or you have a single point of failure,” Mueller says. Avoiding single points of failure is cloud computing lesson 101.
There has long been a difference in perspectives between older companies moving key existing applications—think an insurance company’s underwriting software—to the cloud and “newbies” that were born there.
Holger’s overriding point is that big, older companies that are moving stuff have to work carefully take advantage of cloud but even born-to-the-cloud companies can screw things up.
And both groups need to give lots of thought to how their applications can make best use of a massive shared infrastructure and mitigate the risk of hardware or other failure—because it will happen.