Focus on Managing Your Data Science Workspaces, Not Your Data Volumes

Mike Oglesby

Mike Oglesby

April 07, 2021

363 views

manage your data science workspace We’ve all been there. The project works great as a proof of concept, but when it comes time to move to production, progress stalls. Challenges around data management and traceability become painful roadblocks. Unfortunately, this is all-too-common a problem in the world of enterprise AI. Although the emerging machine learning operations (MLOps) ecosystem offers many tools for iterative AI model training and deployment, most of those tools don’t streamline data management. And those that do handle data management are often complex and force data scientists to manage storage resources separately from their data science workspaces.

To address this gap, we’ve developed the NetApp® Data Science Toolkit for Kubernetes, which is included in the newly released version 1.2 of the NetApp Data Science Toolkit. This toolkit abstracts storage resources and Kubernetes workloads up to the data science workspace level. Best of all, these capabilities are packaged in a simple, easy-to-use interface that’s designed for data scientists and data engineers. Using the familiar form of a Python program, the toolkit enables data scientists and engineers to provision and destroy JupyterLab workspaces in just seconds. These workspaces can contain terabytes, or even petabytes, of storage capacity, allowing data scientists to store all of their training datasets directly in their project workspaces. Gone are the days of separately managing workspaces and data volumes.

All of the under-the-hood storage and Kubernetes operations, which would otherwise require help from both a DevOps engineer and a storage administrator, are executed automatically. These self-service capabilities can significantly speed up AI projects, removing time-consuming IT request-response cycles.

Clone workspaces in seconds

With the NetApp Data Science Toolkit for Kubernetes, a data scientist can almost instantaneously create a JupyterLab workspace that’s an exact copy of an existing workspace, even if the workspace contains terabytes or even petabytes of data and notebooks. Data scientists can quickly create clones of JupyterLab workspaces that they can modify as needed, while preserving the original “gold-source” workspace. These operations are built on top of NetApp Trident, NetApp’s enterprise-class dynamic storage orchestrator for Kubernetes, and NetApp’s highly efficient and battle-tested cloning technology. And they can be performed directly by data scientists who don’t have storage or Kubernetes expertise. Operations that used to take days or weeks, and the assistance of both a DevOps engineer and a storage administrator, now take a data scientist just seconds.

Traceability made easy

Data scientists can also save space-efficient, read-only copies of existing JupyterLab workspaces. Based on Trident and NetApp Snapshot™ technology, this functionality can be used to version workspaces and implement workspace-to-model traceability. Best of all, since datasets can now be stored directly within workspaces, there is no need to separately implement dataset traceability. Dataset-to-model traceability is literally built in to the workspace. In regulated industries, traceability is a baseline requirement, and implementing it is often extremely cumbersome. Now, with the Data Science Toolkit for Kubernetes, it’s amazingly easy.

Automate your workflows

You can also use the Data Science Toolkit for Kubernetes in conjunction with a workflow management platform, such as Apache Airflow or Kubeflow Pipelines, to automate various AI workflows. Do you have a workflow that involves provisioning or cloning a data scientist workspace? You can use the toolkit to automate the workspace provisioning or cloning step. Do you have a complicated compliance workflow that involves implementing traceability? No problem, you can automate that too. NetApp Data Science Toolkit for Kubernetes

NetApp Data Science Toolkit for Kubernetes

With the NetApp Data Science Toolkit, data scientist self-service really is possible. To learn more, visit the toolkit’s GitHub repository.

Mike Oglesby

Mike Oglesby

Mike is a Technical Marketing Engineer at NetApp focused on MLOps and Data Pipeline solutions. He architects and validates full-stack AI/ML/DL data and experiment management solutions that span a hybrid cloud. Mike has a DevOps background and a strong knowledge of DevOps processes and tools. Prior to joining NetApp, Mike worked on a line of business application development team at a large global financial services company. Outside of work, Mike loves to travel. One of his passions is experiencing other places and cultures through their food.

View all Posts by Mike Oglesby

Next Steps

Blogs

Brush up on the latest trends and developments in cloud, on premises, and everywhere in between. This is where it all gets real, with a cherry on top.

Get to reading

Community

Explore a wide range of open forums where you can post questions, share answers and just generally get smart on all the NetApp technologies that matter most to you.

Join the discussion