Amazon’s cloud nightmare

April 22, 2011, 2:24 PM UTC

By Dan Mitchell, contributor

FORTUNE — The snafu at Amazon’s EC2 hosting service on Thursday, which knocked several big web sites out of service, is being called a “black eye”  for the cloud-computing business — a “we told you so” moment, according to cloud critics. But it could simply be a black eye for EC2.

It seems unlikely that this incident will cause startups to turn away from cloud computing, which for smaller companies is much cheaper than self-hosting. More likely, some of them will think twice about hosting with EC2, one of the industry leaders. That’s because this was a particularly nasty, widespread, and long-lasting outage. A whole bunch of sites were thrown totally or partially out of commission for most of the day Thursday, including Quora, Foursquare, Reddit, and HootSuite. (Update: some sites are still down as of midday Friday.)

Technical glitches are bound to happen, of course, whether they’re in the cloud or in an expensive, staff-managed corporate server room. Sites go down all the time. But most often, the outages are brief and isolated. Apparently, this one jumped across various parts of Amazon’s cloud like lightning – in a way that Amazon had vowed it never could thanks to its “Availability Zones.”

“Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones,” Amazon (AMZN) says . The zones “protect your applications from failure of a single location.

“It would seem they don’t exactly work as they’re designed,” notes The Register’s Cade Metz, who goes on to explain how Amazon’s cloud is (or is supposed to be) set up to avoid such failures. The company isn’t talking yet, but promises to publish a “postmortem” on the situation.

Some customers and observers said on Thursday that Amazon has a history of cloud trouble, though never anything this bad. Michael Hussey, chief executive of PeekYou, told Dow Jones that he’s been “seeing problems all year long” from Amazon’s Northern Virginia facility, where the failure occurred. The problems haven’t been severe, he said, but they have caused his company to consider moving its data to a different cloud provider.

PeekYou didn’t go down, because it also runs many of its own servers. Foursquare was knocked out entirely, but said in a blog post Thursday morning that Amazon is a “usually amazing” provider that was suffering “a few hiccups.”

The notion that this outage says anything about cloud computing’s utility as a whole seems far-fetched. “We don’t think the cloud is enterprise-ready,” Jimmy Tam, general manager of Peer Software, told The New York Times. His company provides data-backup services. “Are you really going to trust your corporate jewels to these cloud providers?” he asked.

Well, sure. Why not? Lots of companies have been doing it for years now (since before “cloud computing” became a popular term, as Oracle (ORCL) chief Larry Ellison has noted), with little trouble. And even this glitch, though nasty, didn’t put anybody’s “corporate jewels” at risk. Many companies, like PeekYou, use the cloud for routine data handling, and keep their more “mission critical” tasks in-house. It all comes down to quality and service — there’s nothing inherent in cloud computing that makes it less reliable than the alternative.

According to the research firm Gartner, the cloud market will grow to $102.1 billion net year, up from $68.3 billion last year. Startups and other companies will continue to weigh the benefits of cloud computing — mainly, massive cost savings and easy scalability — against the relatively small risk and annoyance of outages. Risk and annoyance, by the way, that they would likely have to face anyway.

The level of cost savings from moving to the cloud depends on all kinds of variables, of course — from the size of the enterprise to the nature of the data being handled. But just for example, a 2009 study by Booz Allen Hamilton, “The Economics of Cloud Computing,” found that “the benefit-to-cost ratio of a non-virtualized  1,000-server data center could reach 15.4:1 after implementation, and total life cycle cost may be 66% lower than maintaining a traditional  data center.” In other words, Amazon and its cloud customers will learn from this outage, but those savings are simply too great to dismiss.

It’s unknown how much Foursquare, for example, has saved by operating from the cloud. But if the savings are typical, it’s probably worth the occasional annoyance of its users being temporarily unable to become mayor of their local coffee shop.

More from Fortune: