Do Data Scientists Really Need to Know How to Operate Public Cloud?

The case of Cloud Computing for Data Scientists has been made pretty clear in the past few years. In comparison to the traditional on-premises solutions, it provides excellent scalability, simplifies management, and reduces costs.

A recent survey, which analyzed the answers of almost 24,000 data professionals, revealed that 69% of data scientists have used at least one cloud computing product in the last 5 years. The study also revealed that the most popular cloud computing products include AWS Elastic Compute, Google Cloud Engine, AWS Lambda.

But, Cloud is complex to be utilized efficiently. It is daunting and challenging even for specialized Cloud and DevOps professionals.

Of course, you could try and learn cloud best practices yourself. “How hard can it be?” a data scientist might ask himself.

To be frank, it would take you several years to read all the available documentation of two public clouds only. So, it’s probably not the best option. Because you are looking to deliver your data science solutions quickly.

On the other hand, if you have access to a team of sys-admin ops, who set up the cloud environments that your applications need on your behalf, you will agree that everyone’s life would be much easier without the lengthy approval cycles and conflicts of interests between these two teams. There is a great need for data scientists to have access to their required infrastructure on demand. Without needing know-how of Cloud and multiple tools and services.

A study made by Alex Hanna, who is a Computational social scientist working on the ML curriculum at Google, developed three profiles of data scientists and their daily challenges. Ready to be used as guidelines in order to better serve data scientists’ needs for Google Training Resources of GCP’s AI products.

In her study, she is saying that in general, data scientists typically have a set of blockers, uncertainties, or topics on which they’re unsure. When adopting a new technology, they may have questions which range from: “Can I do this particular task on the public cloud?” to “Where do I start to even begin learning about this?”.

One of the three personas she built is the data engineer Sasha.

Sasha works for a startup that develops personalized wardrobe recommendations in Austin, Texas. He is 27, and before working at the startup, he had been a developer at a regional healthcare cooperative.

Competencies

Sasha is fluent in designing data pipelines and constructing scalable infrastructure for multiple types of users of the company’s systems. As part of a smaller team, they understand the need of a large number of downstream stakeholders. Including data analysts, data scientists, and ML engineers. As such, they are adepts of designing systems for data acquisition and importing, integrating new data sources, and extract/transform/load operations.

Goals

Sasha wants to migrate their startup’s current workload from Amazon Web Services to Google Cloud Platform. They have heard a lot of good things about the machine learning and AI tools available on GCP. And now are looking forward to learn some of the best practices for helping him deploy those tools.

Challenges

But Sasha is worried about having to serve as both an infrastructure architect and a data engineer for their current company. Although they have an advantage because they are working at a cloud-native company and have set up much of their current infrastructure. But, they are somewhat overwhelmed by having to translate that to a new cloud.

Now comes the good part

We think Bunnyshell can help Sasha focus on his data science job, while technology takes care of migrating and maintaining the infrastructure. And if your challenges sound anything like Sasha’s, I think you will find Bunnyshell to be a pretty interesting tool.

Our goal is to make the cloud easy to use for everyone. We do that through translation and abstraction. What that means is we created an intuitive layer over most popular clouds and services, like AWS, Azure, DigitalOcean, Google Cloud. The layer automates all the infrastructure-related work based on best practices and then makes it available for the user with a few clicks. For example, it takes a few clicks to create servers, or clusters, provision them with software like Hadoop, Cassandra, Spark. Deploy applications with one click in any cloud, scale or destroy machines on demand, without ever leaving the interface.