Skip to Content

Google does Hadoop and Spark as a service, with Cloud Dataproc

There comes a time when even the most ardent open source users might not want to manage their own servers and patch their own software. So goes the thinking behind Cloud Dataproc, Google’s (GOOG) new managed big-data service for running Hadoop and Spark as a service on Google’s cloud computing platform.

Hadoop and Spark are popular open source technologies for processing large amounts of data, but they are notoriously difficult to operate, especially in large deployments. Commercial technology vendors such as Cloudera and Hortonworks (HDP) are trying to solve this problem for users running these technologies in data centers, but the easiest option—for those willing to give up some control over their server—is just to have a cloud provider take care of it for them.

Other cloud providers, including Amazon Web Services (AMZN) and Microsoft (MSFT), already offer managed services for Hadoop and Spark, so Google is not exactly blazing any new trails here. Where Google says it is doing something different is on cost: Cloud Dataproc costs just 1 cent per CPU per hour (billed by the minute), and can cost between 50% to 70% less than comparable services depending on how much customers use it, Google Cloud Product Manager Greg DeMichillie told Fortune.

Google is also touting the integration of Cloud Dataproc with the company’s other cloud computing services for big data—including BigQuery, Cloud Storage and Cloud Bigtable (a database technology)—and the ability to work with Dataproc using standard interfaces. DeMichillie said Dataproc clusters take an average of about 90 seconds to come online, compared with at least several minutes if you’re deploying them on local servers, or even running open source Hadoop or Spark on cloud-provider virtual machines. Minutes—whether it’s 2 or 30—can make a big difference if you need those resources now, or if you’re being billed while machines are still spinning up.

Really, though, Google created Dataproc because customers wanted it and Google had a void in its cloud platform by not having it. Big data workloads are becoming more important with each passing day, especially as trends such as the Internet of Things provide a tangible, viable use case for years’ worth of talk about data analysis. If you’re a cloud provider and the only options for users are try to manage open source software on your own or use our proprietary big data technology (Cloud Dataflow, in Google’s case), customers might start looking elsewhere.

Google might truly believe it proprietary Cloud Dataflow is the best way to manage and run data jobs (much like Microsoft might really believe the Prajna technology it’s building is superior), “but you do have to make some adjustments,” DeMichillie acknowledged.

“The thing we learned across all these things is there’s no one-size fits all,” he said. “… We definitely did talk to customers who were telling us [they would] really rather have a fully managed service [for open source technologies].”