Menu

Using XCP to Move Data from a Data Lake and High-Performance Computing to ONTAP NFS

Karthikeyan Nagalingam
1,723 views

Learn how to move your data by using the NetApp® XCP migration tool for NFS and non-NFS source data. This blog explains how to move your data by using the NetApp® XCP for NFS and non-NFS source data. It describes using the XCP solution architecture to move the data from a data lake (Hadoop distributed file system or MapR-FS) and high-performance computing (GPFS) to NetApp ONTAP® NFS for artificial intelligence (AI) and other cutting-edge technologies. Two customer use cases illustrate the solution architecture.

NetApp data mover solution for AI and HPC using XCP. Data migration to XCP main flow with MapRFS

Use Case 1: Data Lake to ONTAP NetApp NFS

This use case is based on the largest financial customer proof of concept (CPOC) that we have done. In the past, we used the NetApp In-Place Analytics Module (NIPAM) to move the analytics data to NetApp ONTAP AI. Because of recent enhancements and the great performance of XCP, as well as NetApp’s unique data mover solution approach, we reran the data migration using the NetApp XCP migration tool.

Customer Challenges and Requirements

  • The customer has different types of data, including structured, unstructured, and semi-structured data, logs, and machine-to-machine data in a data lake. AI systems require all these types of data to process for prediction operations. When the data is in a data lake native file system, it’s difficult to process.
  • The customer’s AI architecture is not able to access data from Hadoop Distributed Filesystem(HDFS) and Hadoop Compatible Filesystem(HCFS), so the data is not available to AI operations. AI requires data in an understandable file system format such as NFS.
  • Some special processes are needed to move the data from the data lake, because the amount of data is huge, a high-throughput and cost-effective method is required to move the data to the AI system.

Data Mover Solution

In this solution, the MapR Filesystem (MapR-FS) is created from local disks in the MapR cluster. The MapR NFS Gateway is configured on each data node with virtual IPs. The file server service stores and manages the MapR-FS data. NFS Gateway makes Map-FS data accessible from the NFS client through the virtual IP. An XCP instance is running on each MapR data node to transfer the data from the Map NFS Gateway to NetApp ONTAP NFS. Each XCP instance transfers a specific set of source folders to the destination location.

NetApp data mover solution for MapR cluster using XCP. NetApp data mover solution for MapR cluster using XCP To migrate a single source NFS folder of MapR or a Hadoop cluster to NetApp ONTAP NFS, it’s necessary to first install XCP software in a separate server that has a high-end configuration,  because xcp is resource intensive as well as isolate the xcp I/O bound operations away from MapR cluster workloads.

We did the testing for a customer in the CPOC lab, where each MapR data node has 20 CPU or 40 vCPU, 128GB RAM, dual port 25GbE for MapR-FS communication, an XCP host with 28 CPU or 56 vCPU, 384GB RAM, dual port 100GbE for MapR communication, and dual port 100GbE for ONTAP communication. We used a 9-node MapR cluster that has 8 data nodes with NFS Gateway configuration and 1 data node with XCP configuration. We used an A800 system for NetApp ONTAP NFS. XCP has a feature that uses multiple source and destination interfaces to transfer the data. The following figure shows the setup used in the CPOC lab.

Setup used in the CPOC lab. Setup used in the CPOC lab. In the CPOC lab, we generated sample customer Hadoop data (approximately 1TB) using the TeraGen Hadoop utility and performed the data transfer using XCP. The following figure shows the results.

XCP performance. Based on these results, XCP distributes the data transfers across Hadoop members through Hadoop NFS Gateway and XCP instances.  we were able to achieve 8.8 GB/Ses Peak throughput with xcp and 1TB data was transferred in 3 minutes 21 seconds.

For More Information

Watch the videos, Data Mover Solution Parts 1, 2, and 3.

Benefits to the Customer

  • The customer was able to migrate data from different Hadoop file systems into unified NetApp ONTAP NFS storage.
  • There is no additional library development for moving data from the data lake, which reduces the cost of library development.
  • XCP provides maximum performance by aggregated throughput of multiple network interfaces and multiple server resources from a single source of data through multiple XCP instances.
  • XCP can be scheduled using operating system scheduler such as crontab as well as XCP can run an on-demand basis, which provides flexibility for data migration.
  • Using XCP and an NFS client for transfer means zero cost for data movement.

Use Case 2: High-Performance Computing to ONTAP NFS

This use case is based on requests from field organizations. NetApp customers have their data in High Performance Computing which provides data analytics for training models, enables researching organizations to gain insights and understanding from huge amount of digital data. Our field engineers need a detailed procedure to extract the data from IBM’s GPFS to NFS.  we used XCP to migrate the data from GPFS to NFS for GPUs to process them. Artificial Intelligence typically processes data from Network Filesystem.

Customer Requirements

  • The customer needed to run AI workloads on top of the General Parallel Filesystem (GPFS) data.
  • The AI GPUs applications not compatible with Hadoop and HPC GPFS.

Data Mover Solution

The IBM spectrum cluster is running with network shared disk (NSD) servers. The GPFS is  provisioned either from local disks or from NetApp SAN LUN. The NFS is configured on one or more NSD servers to serve the source NFS data. The NSD servers are running in Red Hat Enterprise Linux or SUSE Linux, which provides nfsv3 client drivers. XCP instances are configured on each NSD server to transfer NFS data from NSD servers to NetApp ONTAP NFS.

GPFS to NetApp ONTAP NFS configuration GPFS to ONTAP NFS

Benefits to the Customer

  • Extracted the data from high-performance computing into unified NetApp ONTAP storage.
  • ONTAP NFS data can be accessed by GPU clusters such as DGX systems.
  • The customer is able to run AI operations on top of NFS data which is migrated from GPFS data as well as sync the data between HPC and an AI GPU cluster.

XCP is suitable for the following scenarios:

  • High file count environments (hundreds of millions of files)
  • Environments that require short cutover windows
  • Third-party storage to NetApp storage controller
  • Customized reporting based on data accessed, modified, size, owner, and file type
The technical details for moving GPFS and MapRFS data to NFS are document in TR-4732.

Let me know what you think about this blog and TR in the comments below!

Karthikeyan Nagalingam

Karthikeyan Nagalingam is a Principal Technical Marketing Engineer at NetApp for NetApp XCP, Fpolicy, Filesystem Analytics and Antivirus. His previous roles in Emerging Technology Solutions involved in Pre-Sales and Post-Sales technical activities with fields, partners and customers. He holds an Master of Science in Software Systems from Birla Institute of Technology and Science.

View all Posts by Karthikeyan Nagalingam

Next Steps

Drift chat loading