Kafka and the rise and fall of networked storage

Solutions & Products
- Solutions & Products
- Cloud Services
  Cloud Services
  
  World-class data management and storage solutions in the biggest public clouds.
  Visit Cloud Services
  
  Solutions
  
  Microsoft Azure
  
  Google Cloud
  
  AWS
  
  IBM Cloud
  
  Products
  
  Azure NetApp Files
  
  Amazon FSx for NetApp ONTAP
  
  Cloud Volumes Service for Google Cloud
  
  Cloud Volumes ONTAP
  
  Compute Optimization
  
  Cloud Sync
  
  Cloud Data Sense
  
  Cloud Tiering
  
  Cloud Backup Service
  
  Cloud Volumes Edge Cache
  
  Global File Cache
  
  Cloud Manager
  
  Astra
  
  Cloud Insights
  
  File Services / File Sharing
  
  MySQL
  
  PostgreSQL
  
  Kubernetes
  
  Quick Links
  
  Cloud Central
  
  Data Fabric
  
  Why NetApp for Cloud Services
  
  Spot by NetApp
  
  Customer Stories
  
  Test Drive
  
  Free Trials
  
  How to Buy
- Hybrid Cloud
  Hybrid Cloud
  
  Build your business on the best of cloud and on premises together with Hybrid Cloud Infrastructure solutions.
  Visit Hybrid Cloud
  
  Solutions
  
  Virtualization
  
  Service Provider Infrastructure
  
  IT Automation
  
  Private Clouds
  
  VMware
  
  Red Hat
  
  Quick Links
  
  Data Fabric
  
  Why NetApp for Hybrid Cloud
  
  What is Hybrid Cloud
  
  Customer Stories
  
  Test Drive
  
  Free Trials
  
  How to Buy
- Data Storage
  Data Storage
  
  NetApp is the proven leader when it comes to modernizing and simplifying your storage environment.
  Visit Data Storage
  
  Solutions
  
  SAN
  
  Scale-Out NAS
  
  Unstructured Data Solutions
  
  Products
  
  AFF A-Series
  
  AFF C190
  
  E-Series
  
  EF-Series
  
  FAS
  
  FlexPod
  
  SolidFire
  
  StorageGRID
  
  Disk Shelves & Storage Media
  
  Quick Links
  
  Data Fabric
  
  Why NetApp for Data Storage
  
  Customer Stories
  
  Test Drive
  
  Free Trials
  
  How to Buy
- Cyber Resilience
  Cyber Resilience
  
  Our industry-leading solutions are built so you can protect and secure your sensitive company data.
  Visit Cyber Resilience
  
  Solutions
  
  Data Protection
  
  Ransomware Protection
  
  Business Continuity / Disaster Recovery
  
  Data Backup and Recovery
  
  Data Compliance
  
  ONTAP Data Security
  
  Products
  
  SnapCenter
  
  Cloud Backup
  
  Quick Links
  
  Data Fabric
  
  Customer Stories
  
  Test Drive
  
  Free Trials
  
  How to Buy
- Data Management
  Data Management
  
  Get complete control over your data with simplicity, efficiency, and flexibility.
  Visit Data Management
  
  Solutions
  
  Simplicity365
  
  Products
  
  Active IQ
  
  Element Software
  
  OnCommand Insight
  
  ONTAP Data Management
  
  SANtricity Software
  
  Virtual Infrastructure Management
  
  Quick Links
  
  Data Fabric
  
  Data Management Specialists
  
  Customer Stories
  
  Test Drive
  
  Free Trials
  
  How to Buy
- Enterprise Applications
  Enterprise Applications
  
  Speed application development, improve software quality, reduce business risk, and shrink costs.
  Visit Enterprise Applications
  
  Solutions
  
  SAP
  
  Oracle
  
  MS SQL
  
  Quick Links
  
  Data Fabric
  
  Why NetApp for Enterprise Applications
  
  Customer Stories
  
  Test Drive
  
  Free Trials
  
  How to Buy
- DevOps
  Devops
  
  Our solutions remove friction to help maximize developer productivity, reduce time to market, and improve customer satisfaction.
  Visit Devops
  
  Solutions
  
  Configuration Management
  
  Containers
  
  Google Clouds Anthos
  
  Continuous Integration Continuous Delivery
  
  Quick Links
  
  Data Fabric
  
  Why NetApp for DevOps
  
  What is DevOps
  
  Customer Stories
  
  Test Drive
  
  Free Trials
  
  How to Buy
- AI
  AI
  
  NetApp AI solutions remove bottlenecks at the edge, core, and the cloud to enable more efficient data collection.
  Visit AI
  
  Solutions
  
  Big Data Analytics
  
  High Performance Computing
  
  Products
  
  ONTAP AI
  
  Quick Links
  
  Data Fabric
  
  Why NetApp for AI
  
  What is AI
  
  Customer Stories
  
  Test Drive
  
  Free Trials
  
  How to Buy
- VDI
  VDI
  
  Provide a powerful, consistent end-user computer (EUC) experience—regardless of team size, location, complexity.
  Visit VDI
  
  Products
  
  Spot PC
  
  Virtual Desktop Service
  
  Quick Links
  
  Data Fabric
  
  What is VDI
  
  Customer Stories
  
  Test Drive
  
  Free Trials
  
  How to Buy
- Services
  Services
  
  We have a service for your every need, plus the ones you’re about to discover.
  Visit Services
  
  Services
  
  Professional Services
  
  Support Services
  
  Quick Links
  
  Data Fabric
  
  Customer Stories
  
  Test Drive
  
  Free Trials
  
  How to Buy
Support & Training
How to Buy
Community

Why performance is important for tiered storage

I’ll get around to cost-effectiveness and ease of use, but it’s worth noting that Confluent itself considers performance isolation as the most critical requirement. If an application is reading historical data, it won’t add latency to other applications that are reading more recent data. In the words of Confluent, this factor “opens the door for real-time and historical analysis use cases in the same cluster.” This factor shows again why first-rate quality of storage capabilities is important for any shared storage infrastructure, but it also points to the importance of performance for the tiered storage layer.

Kafka doesn’t want to send its data to some archive repository that’s designed for cheap and deep cost optimization just to comply with an obscure government regulation that data be kept forever and a day. Kafka wants to send this data to something that can be used as an active archive. Some of that data, especially the most recent historical data, is likely to be quite “warm.” Even if the data cools over time, it’s important to provide an initial landing zone for data that no longer needs to be on the Kafka brokers. It not only speeds the most important historical and exploratory analytics, but it also provides a range of operational and ease-of-use benefits that I’ll talk about later.

Confluent validation with StorageGRID and comparisons with Amazon S3 and Pure FlashBlade

The design focus for StorageGRID, NetApp’s premier S3 offering, has always been for it to perform as an active archive, making it an optimal match for the needs of Kafka’s tiered storage. Recently, NetApp spent some time with Confluent to verify the performance and interoperability of Kafka with StorageGRID. The results are in our technical report TR-4912: Best practice guidelines for Confluent Kafka tiered storage with NetApp. If deep-diving into Kafka is your thing and you’re interested in all the details, I strongly recommend that you check out the report. But if you’re short on time, I highlight a few worthwhile things here.

StorageGRID leads the way in speed

StorageGRID has the fastest published performance of any object storage platform that’s currently validated by Confluent. Even our smallest three-node all-flash configuration easily outperforms our nearest rival, and if Pure’s publication of their test results are to be believed, StorageGRID is also 6 and a half times faster than Amazon S3. The following graph shows the combined results of NetApp’s and Pure’s testing:

Cost-effectiveness: Get the performance/capacity balance right without lock-in

The graph above shows that even with a small number of nodes, StorageGRID can deliver more performance than most Kafka users are looking for. With an ability to scale to well over 100 nodes, going to more than 100GBps at a single site is a straightforward exercise. But most customers would rather spend that money on extra capacity instead.

Which brings up the next point: The vast majority of data in the grid probably won’t warrant that level of performance. If the Amazon S3 number in the graph is accurate, I think it’s fair to say that 1GBps is ample for many kinds of historical reporting.

Mix and match with StorageGRID—and retain centralized management

That’s where another StorageGRID feature begins to shine. It can run on almost any hardware, not just the all-flash StorageGRID SGF6024 appliances, but also NetApp’s hybrid and high-density hard-drive-based appliances. And they can be mixed and matched inside the same grid.

Let’s say that you don’t want to buy NetApp appliances because you got a great deal on dense rack servers from a company like Lenovo, HPE, or even Dell EMC. Or maybe you have some spare capacity on your VMware infrastructure. You can install the StorageGRID software on pretty much any modern machine, virtual or physical that can run Docker, and you can add it to your grid. You might even find that putting StorageGRID software on old Kafka nodes is a great way to get more value out of a kit that you already own.

This ability to use your own servers isn’t just about getting less expensive storage. If you decide that you need even more performance than the SGF6024 provides, you can install StorageGRID software on a system that’s packed with as much CPU and memory as you like. To top it all off, unlike solutions that focus only on proprietary hardware in the data center, StorageGRID helps you get more value from your cloud credits. StorageGRID can automatically move or mirror your data into any of the major public clouds, all while keeping management in one place.

Automatically store data in the right location

That ability to move or to copy data to the right place is part of the larger integrated lifecycle management capabilities of StorageGRID. You can automatically get your data to the optimal location based on metadata rules. Then you keep your most important data both safe and warm, and you can keep your colder Kafka data for years without breaking the bank. You can easily implement Confluent’s vision for tiered storage.

“Imagine if the data in Kafka could be retained for months, years, or infinitely. The above problem can be solved in a much simpler way. All applications just need to get data from one system—Kafka—for both recent and historical data.” —Jun Rao, Confluent, Project Metamorphosis Month 3: Infinite Storage in Confluent Cloud for Apache Kafka.

Pure can do the performance thing pretty well (though not as well as NetApp can), but that’s about the extent of it. Pure doesn’t have any way of pushing cool data into lower-cost locations, and it can’t fully use the storage in public cloud. So, FlashBlade ends up as a relatively limited tactical product rather than something that you can use to build a comprehensive strategy for long-term Kafka data management.

Ease of use and operational simplicity: Eliminate backup and automate provisioning and rebalancing

Protect your data with a distributed architecture

Scaling isn’t just about performance or capacity. Often, the scarcest resources are time and expertise. NetApp StorageGRID is a distributed architecture, not just within a single tightly controlled hardware appliance but across truly geographical scale. Here’s an example:

Share this page

Ricky Martin

The story begins with grid computing

SAN falls out of favor with the cloud-native crowd

External storage makes a comeback

Why performance is important for tiered storage

Confluent validation with StorageGRID and comparisons with Amazon S3 and Pure FlashBlade

StorageGRID leads the way in speed

Test configurations were comparable

StorageGRID leads in price/performance

Cost-effectiveness: Get the performance/capacity balance right without lock-in

Mix and match with StorageGRID—and retain centralized management

Automatically store data in the right location

Ease of use and operational simplicity: Eliminate backup and automate provisioning and rebalancing

Protect your data with a distributed architecture

Set up policies, then let StorageGRID do the work

Reap huge benefits for Kafka

Networked storage still has a place in modern data analytics

Modernize your own data analytics infrastructure

Ricky Martin

Next Steps

Blogs

Community