Kinetic Campaign: Speeding Up Scientific Data Analytics with Computational Storage Drives and Multi-Level Erasure Coding

Abstract

Large-scale data analytics, machine learning, and big data applications often require the storage of a massive amount of data. For cost-effective high bandwidth, many data centers have used tiered storage with warmer tiers made of flashes or persistent memory modules and cooler tiers provisioned with high-density rotational drives. While ultra fast data insertion and retrieval rates have been increasingly demonstrated by research communities and industry at warm storage, complex queries with predicates on multiple columns tend to still experience excessive delays when unordered, unindexed (or potentially only lightly indexed) data written in log-structured formats for high write bandwidth is subsequently read for ad-hoc analysis at row level. Queries run slowly because an entire dataset may have to be scanned in the absence of a full set of indexes on all columns. In the worst case, significant delays are experienced even when data is read from warm storage. A user sees even higher delays when data must be streamed from cool storage before analysis takes place. In this presentation, we present C2, a research collaboration between Seagate and Los Alamos National Lab (LANL) for the lab's next-generation campaign storage. Campaign is a scalable cool storage tier at LANL managed by MarFS that currently provides 60 PBs of storage space for longer-term data storage. Cost-effective data protection is done through multi-level erasure coding at both node level and rack level. To prevent users from always having to read back all data for complex queries, C2 enables direct data analytics at the storage layer by leveraging Seagate Kinetic Drives to asynchronously add indexes to data at per-drive level after data lands on the drives. Asynchronously constructed indexes cover all data columns and are read at query time by the drives to drastically reduce the amount of data that needs to be sent back to the querying client for result aggregation. Combining computational storage technologies with erasure coding based data protection schemes for rapid data analytics over cool storage presents unique challenges in which individual drives may not be able to see complete data records and may not deliver performance required by high-level data insertion, access, and protection workflows. We discuss those challenges in the talk, share our designs, and report early results.

Download Presentation

Qing Zheng

Los Alamos National Lab

Related Sessions

Computational Storage

The latest Efforts in the SNIA Computational Storage Technical Work Group (CS TWG)

With the ongoing work in the CS TWG, the chairs will present the latest updates from the membership of the working group.

Jason Molgaard

Solidigm

Scott Shadley
Solidigm Technology, SNIA

Favorites

Computational Storage

NVMe Computational Storage – An update on the Standard

Learn what is happening in NVMe to support Computational Storage devices.

Kim Malone

Intel

Favorites

Computational Storage

Computational Storage APIs

Computational Storage is a new field that is addressing performance and scaling issues for compute with traditional server architectures.

Oscar Pinto

Samsung Semiconductor Inc

Favorites

Computational Storage

Computational Storage: How Do NVMe CS and SNIA CS Work Together?

NVMe and SNIA are both working on standards related to Computational Storage. The question that is continually asked is are these efforts are compatible or at odds with each other.

William Martin

Samsung

Favorites

Computational Storage

HDD Computational Storage Benchmarking

This presentation looks at a computational storage use-case within the Human Cell Atlas genomics research and discovers that the deployed HW CS engine is insufficient and why this is the case.

Philip Kufeldt

Seagate Technology

Favorites

Computational Storage

RETINA: Exploring Computational Storage (SmartSSD) Usecase

Computational Storage offers near-data acceleration, and it is gaining popularity with recent commercialization and standardization efforts.

Vishwanath Maram

Samsung Semiconductor Inc

Changwoo Min
Virginia Tech

Favorites

Computational Storage

Making Real File Systems Faster with Applied Computational Storage

The exploration of computation near flash storage has been prompted by the advent of network-attached flash-based storage enclosures operating at tens of gigabytes/sec, server memory bandwidths str

Dominic Manno

Los Alamos National Laboratory

Sean Gibb
Eideticom
Andrew Maier
Eideticom

Favorites

Computational Storage

Green Computing with Computational Storage Devices

Data center systems power consumption is currently one of the biggest concern and green computing is main industry interest.

Changho Choi

Samsung

Yangwook Kang
Samsung Semiconductor, Inc.

Favorites

Computational Storage

The Apache Ozone: A Distributed Object Storage System is with Erasure Coding

Apache Ozone is a highly scalable distributed object storage system and also provides the file system interface.

Uma Maheswara Rao Gangumalla

Cloudera Inc

Favorites

Computational Storage

Accelerating Near Real-time Analytics with High Performance Object Storage

Computational storage in general can bring unique benefits in increasing the efficiency of CPU utilization in a data processing system.

Mayank Saxena

Samsung

Favorites

Computational Storage

File System Acceleration using Computational Storage for Efficient Data Storage

We examine the benefits of using computational storage devices like Xilinx SmartSSD to offload the compression to achieve an ideal compression scheme where higher compression ratios are achieved wi

Vaishnavi S G

AMD

Favorites

Main menu

You are here

Kinetic Campaign: Speeding Up Scientific Data Analytics with Computational Storage Drives and Multi-Level Erasure Coding