Resource-Aware Data Systems Publications

CXL Memory Performance for In-Memory Data Processing

Marcel Weisgut, Daniel Ritter, Pınar Tözün, Lawrence Benson, Tilmann Rabl

Accepted to VLDB 2025

Github: Microbenchmarks and Hyrise

Abstract

The Compute Express Link (CXL) standard enables new forms of memory management and access across devices and servers. Based on PCIe, it enables cache-coherent access to remote memory. This widens the design space for database systems by expanding the available memory beyond memory local to the CPU. Efficiently utilizing CXL-attached memory requires conscious decisions by data systems about data placement and management. In this paper, we provide an in-depth analysis of database operation performance with data interleaved across multiple CXL memory devices. We experimentally evaluate the memory access performance for basic access patterns, the performance impact of placing data across multiple CXL memory devices for in-memory column scans and inmemory B+tree operations, and the performance impact of placing data in CXL memory for an in-memory database system when running the analytical TPC-H workload. Our experiments show that access to CXL-attached memory does not have to penalize performance over local access, but careful workload-aware data management is required. Our TPC-H evaluation shows that placing table columns based on access frequencies allows storing over 80% of the table data in CXL memory with a performance of 85% of a local-memory-only solution.

CXL-Bench: Benchmarking Shared CXL Memory Access

Marcel Weisgut, Daniel Ritter, Florian Schmeller, Pınar Tözün, Tilmann Rabl

Accepted to ADMS 2025

Github

Abstract

Memory access paths between a CPU core and memory are increasingly complex. Data can be placed on local- or remote-socket memory, and on local- and remote-die memory on modern multidie CPUs, affecting memory access performance. Cache-coherent inter-device interconnects, such as Compute Express Link (CXL), allow a CPU core to perform load and store instructions to memory of a peripheral device. Such accesses incur higher access latency than accesses to local-socket memory and increase the access path complexity. For database system developers, it is important to understand the performance implications of these complex memory architectures. In this work, we present CXL-Bench, a benchmark framework for quantifying access performance for different memory access paths. CXL-Bench provides many configuration options, such as memory access patterns, the operating system’s memory abstraction, cache bypass options, and a distributed mode for setups with multiple servers accessing memory of the same device. We demonstrate the utility of CXL-Bench by quantifying memory access characteristics of two servers accessing a shared CXL 1.1 memory device. Our results show that memory accesses of one server to the device affect the access performance of another server accessing the same device. On the other hand, memory (de)allocations using CXL memory configured as a character device complete quickly, making frequent re-allocation of CXL memory feasible.

Path to GPU-Initiated I/O for Data-Intensive Systems

Karl B. Torp, Simon A. F. Lund, Pınar Tözün

Accepted to DaMoN 2025

Github

Abstract

The process of training and serving deep learning (DL) models is computationally expensive, mandating the use of powerful and expensive accelerators such as GPUs and TPUs. Furthermore, the prevalence of GPUs in data centers today motivate developing database systems that can leverage the available GPU resources. Both the latency of DL tasks and database queries and high utilization of these accelerators depend on how efficiently we can move the data to the accelerators. Given today’s dataset sizes, fitting everything in GPU or even CPU memory is not always feasible or can be expensive. The I/O path while fetching the data from disks, however, still dominantly relies on CPUs.
In this work, we take a step toward understanding today’s landscape for optimizing the I/O path for reading data to GPUs from disks, with a focus on SSDs. First, we review the prominent technologies that target GPU-centric storage accesses. Then, we dive deeper into BaM, as the state-of-the-art method for GPU-centric storage, and evaluate its performance in comparison to the state-of-theart CPU-centric storage interface SPDK. Our results demonstrate that while BaM is able to match the performance of SPDK without involving CPUs on the I/O path, this comes at the cost of a very high GPU use. Finally, we highlight future research directions to enable an I/O path that is both efficient and easy-to-adopt for data-intensive systems that use GPUs.

Modyn: Data-Centric Machine Learning Pipeline Orchestration

Maximilian Böther, Ties Robroek, Viktor Gsteiger, Robin Holzinger, Xianzhe Ma, Pınar Tözün, Ana Klimovic

Accepted to SIGMOD 2025

Github

Abstract

In real-world machine learning (ML) pipelines, datasets are continuously growing. Models must incorporate this new training data to improve generalization and adapt to potential distribution shifts. The cost of model retraining is proportional to how frequently the model is retrained and how much data it is trained on, which makes the naive approach of retraining from scratch each time impractical.
We present Modyn, a data-centric end-to-end machine learning platform. Modyn’s ML pipeline abstraction enables users to declaratively describe policies for continuously training a model on a growing dataset. Modyn pipelines allow users to apply data selection policies (to reduce the number of data points) and triggering policies (to reduce the number of trainings). Modyn executes and orchestrates these continuous ML training pipelines. The system is open-source and comes with an ecosystem of benchmark datasets, models, and tooling. We formally discuss how to measure the performance of ML pipelines by introducing the concept of composite models, enabling fair comparison of pipelines with different data selection and triggering policies. We empirically analyze how various data selection and triggering policies impact model accuracy, and also show that Modyn enables high throughput training with sample-level data selection.

Towards A Modular End-To-End Machine Learning Benchmarking Framework

Robert Bayer, Ties Robroek, Pınar Tözün

Accepted to TDIS 2025

Github

Abstract

Machine learning (ML) benchmarks are crucial for evaluating the performance, efficiency, and scalability of ML systems, especially as the adoption of complex ML pipelines, such as retrieval-augmented generation (RAG), continues to grow. These pipelines introduce intricate execution graphs that require more advanced benchmarking approaches. Additionally, collocating workloads can improve resource efficiency but may introduce contention challenges that must be carefully managed. Detailed insights into resource utilization are necessary for effective collocation and optimized edge deployments. However, existing benchmarking frameworks often fail to capture these critical aspects.
We introduce a modular end-to-end ML benchmarking framework designed to address these gaps. Our framework emphasizes modularity and reusability by enabling reusable pipeline stages, facilitating flexible benchmarking across diverse ML workflows. It supports complex workloads and measures their end-to-end performance. The workloads can be collocated, with the framework providing insights into resource utilization and contention between the concurrent workloads.

Resource-Efficient Machine Learning (Dagstuhl Seminar 24311)

Oana Balmau, Matthias Boehm, Ana Klimovic, Peter Pietzuch, Pınar Tözün

Dagstuhl Reports 2025

Abstract

In this Dagstuhl Seminar, our main goal was to reason critically about how we build software and hardware for end-to-end machine learning. The crowd was composed of experts from academia and industry across fields of data management, machine learning, compilers, systems, and computer architecture covering expertise of algorithmic optimizations in machine learning, job scheduling and resource management in distributed computing, parallel computing, and data management and processing.
During the seminar, we explored how to improve ML resource efficiency through a holistic view of the ML landscape, which includes data preparation and loading, continual retraining of models in dynamic data environments, compiling ML on specialized hardware accelerators, hardware/software co-design for ML, and serving models for real-time applications with low-latency requirements and constrained resource environments. We hope that the discussions and the work planned during the seminar will lead to increased awareness for understanding the utilization of modern hardware and kickstart future developments to minimize hardware underutilization while still enabling emerging applications powered by ML.

The Five-Minute Rule for the Cloud: Caching in Analytics Systems

Kira Duwe, Angelos Anadiotis, Andrew Lamb, Lucas Lersch, Boaz Leskes, Daniel Ritter, Pınar Tözün

Accepted to CIDR 2025

Abstract

For almost 40 years, Gray and Putzolu’s five-minute rule has helped quickly guide system architects to the break-even point between memory caching and direct local storage access. We believe similar rules of thumb are needed for object caches and storage in disaggregated cloud database system designs. However, it is not straightforward to adapt the established rules to the cloud as they presume fixed hardware, while, in the cloud, resources are dynamic and costs are determined by usage.
This paper reviews requirements driving object caches, analyzes the design space, defines a cost model, and proposes new rules of thumb to help system designers determine when caches become cost-effective for analytical workloads in the cloud. While perhaps unsurprising, our analysis on AWS shows that caches are beneficial when a system makes (1) two requests per hour for latency-sensitive workloads, or (2) seven requests per second for non-latency-sensitive workloads. These results are consistent with and help explain the near ubiquity of object store caches in cloud analytics systems.

Reaching the Edge of the Edge: Image Analysis in Space

Robert Bayer, Julian Priest, Pınar Tözün

Accepted to DEEM 2024

Github

Abstract

Satellites have become more widely available due to the reduction in size and cost of their components. As a result, there has been an advent of smaller organizations having the ability to deploy satellites with a variety of data-intensive applications to run on them. One popular application is image analysis to detect, for ex- ample, land, ice, clouds, etc. for Earth observation. However, the resource-constrained nature of the devices deployed in satellites creates additional challenges for this resource-intensive application. In this paper, we present our work and lessons-learned on build- ing an Image Processing Unit (IPU) for this satellite. We first high- light the resource constraints based on a deployed satellite per- forming machine learning on satellite imagery in orbit, including the required latency, power budget, and the network bandwidth limitations driving the need for such a solution. We then inves- tigate the performance of a variety of edge devices (comparing CPU, GPU, TPU, and VPU) for deep-learning-based image process- ing on satellites. Our goal is to identify devices that are flexible when the workload changes while satisfying the power and latency constraints of satellites. Our results demonstrate that hardware accelerators such as ASICs and GPUs are essential for meeting the latency requirements. However, state-of-the-art edge devices with GPUs may draw too much power for deployment on a satellite.

An Analysis of Collocation on GPUs for Deep Learning Training

Ties Robroek, Ehsan Yousefzadeh-Asl-Miandoab, Pınar Tözün

Accepted to EuroMLSys 2024

Github

Abstract

Deep learning training is an expensive process that extensively uses GPUs. However, not all model training saturates modern powerful GPUs. To create guidelines for such cases, this paper examines the performance of the different collocation methods available on NVIDIA GPUs: naïvely submitting multiple processes on the same GPU using multiple streams, utilizing Multi-Process Service (MPS), and enabling the Multi-Instance GPU (MIG). Our results demonstrate that collocating multiple model training runs yields significant benefits, leading to up to three times training throughput despite increased epoch time. On the other hand, the aggregate memory footprint and compute needs of the models trained in parallel must fit the available memory and compute resources of the GPU. MIG can be beneficial thanks to its interference-free partitioning but can suffer from sub-optimal GPU utilization with dynamic or mixed workloads. In general, we recommend MPS as the best-performing and most flexible form of collocation for a single user submitting training jobs.

Data Management and Visualization for Benchmarking Deep Learning Training Systems

Ties Robroek, Aaron Duane, Ehsan Yousefzadeh-Asl-Miandoab, Pınar Tözün

Accepted to DEEM 2023 (Best Presentation Award)

Github

Abstract

Evaluating hardware for deep learning is challenging. The models can take days or more to run, the datasets are generally larger than what fits into memory, and the models are sensitive to interference. Scaling this up to a large amount of experiments and keeping track of both software and hardware metrics thus poses real difficulties as these problems are exacerbated by sheer experimental data volume. This paper explores some of the data management and exploration difficulties when working on machine learning systems research. We introduce our solution in the form of an open-source framework built on top of a machine learning lifecycle platform. Additionally, we introduce a web environment for visualizing and exploring experimental data.

Profiling and Monitoring Deep Learning Training Tasks

Ehsan Yousefzadeh-Asl-Miandoab, Ties Robroek, Pınar Tözün

Accepted to EuroMLSys 2023

Abstract

The embarrassingly parallel nature of deep learning training tasks makes CPU-GPU co-processors the primary commodity hardware for them. The computing and memory requirements of these tasks, however, do not always align well with the available GPU resources. It is, therefore, important to monitor and profile the behavior of training tasks on co-processors to understand better the requirements of different use cases. In this paper, our goal is to shed more light on the variety of tools for profiling and monitoring deep learning training tasks on server-grade NVIDIA GPUs. In addition to surveying the main characteristics of the tools, we analyze the functional limitations and overheads of each tool by using a both light and heavy training scenario. Our results show that monitoring tools like nvidia-smi and dcgm can be integrated with resource managers for online decision making thanks to their low overheads. On the other hand, one has to be careful about the set of metrics to correctly reason about the GPU utilization. When it comes to profiling, each tool has its time to shine; a framework-based or system-wide GPU profiler can first detect the frequent kernels or bottlenecks, and then, a lower-level GPU profiler can focus on particular kernels at the micro-architectural-level.

TPCx-AI on NVIDIA Jetsons

Robert Bayer, Jon Voigt Tøttrup, Pınar Tözün

Accepted to TPCTC 2022

Abstract

Despite their resource- power-constrained nature, edge devices also exhibit an increase in the available compute and memory resources and heterogeneity, similar to the evolution of server hardware in the past decade. For example, NVIDIA Jetson devices have a system-on-chip (SoC) composed of an ARM CPU and an NVIDIA GPU sharing RAM that could be up to 32 GB. Such an SoC setup offers opportunities to push down complex computations closer to the data source rather than performing them on remote servers.
In this paper, we characterize the performance of two types of NVIDIA Jetson devices for end-to-end machine learning pipelines using the TPCx-AI benchmark. Our results demonstrate that the available memory is the main limitation to performance and scaling up machine learning workloads on edge devices. Despite this limitation, some edge devices show promise when comparing against a desktop hardware in terms of power-efficiency and reduction in data movement. In addition, exploiting the available compute parallelism on these devices can benefit not just model training and inference but also data pre-processing. By parallelizing, we get close to an order of magnitude improvement in pre-processing time for one of the TPCx-AI use cases. Finally, while TPCx-AI is a valuable benchmark, it is designed for server settings; therefore, the community needs an end-to-end machine learning benchmark targeting IoT/edge.

Micro-architectural Analysis of a Learned Index

Mikkel Møller Andersen, Pınar Tözün

Published in aiDM 2022

Abstract

Since the publication of The Case for Learned Index Structures in 2018, there has been a rise in research that focuses on learned indexes for different domains and with different functionalities. While the effectiveness of learned indexes as an alternative to traditional index structures such as B+Trees have already been demonstrated by several studies, previous work tend to focus on higher-level performance metrics such as throughput and index size. In this paper, our goal is to dig deeper and investigate how learned indexes behave at a micro-architectural level.
More specifically, we focus on previously proposed learned index structure ALEX, which is a tree-based in-memory index structur that consists of a hierarchy of machine learned models. Unlike the original proposal for learned indexes, ALEX is designed from the ground up to allow updates and inserts. Therefore, it enables more dynamic workloads using learned indexes. In this work, we perform a micro-architectural analysis of ALEX and compare its behavior to the tree-based index structures that are not based on learned models, i.e., ART and B+Tree.
Our results show that ALEX is bound by memory stalls, mainly stalls due to data misses from the last-level cache. Compared to ART and B+Tree, ALEX exhibits fewer stalls and a lower cycles-perinstruction value across different workloads. On the other hand, the amount of instructions required to handle out-of-bound inserts in ALEX can increase the instructions needed per request significantly (10X) for write-heavy workloads. However, the micro-architectural behavior shows that this increase in the instruction footprint exhibit high instruction-level parallelism, and, therefore, does not negatively impact the overall execution time.

Training for Speech Recognition on Co-processors

Sebastian Baunsgaard, Sebastian Benjamin Wrede, Pınar Tözün

Published in ADMS 2020

Abstract

Automatic Speech Recognition (ASR) has increased in popularity in recent years. The evolution of processor and storage technologies has enabled more advanced ASR mechanisms, fueling the development of virtual assistants such as Amazon Alexa, Apple Siri, Microsoft Cortana, and Google Home. The interest in such assistants, in turn, has amplified the novel developments in ASR research. However, despite this popularity, there has not been a detailed training efficiency analysis of modern ASR systems. This mainly stems from: the proprietary nature of many modern applications that depend on ASR; the relatively expensive co-processor hardware that is used to accelerate ASR by big vendors to enable such applications; and the absence of well-established benchmarks. The goal of this paper is to address the latter two of these challenges.
The paper first describes an ASR model, based on a deep neural network inspired by recent work, and our experiences building it. Then we evaluate this model on three CPU-GPU co-processor platforms that represent different budget categories. Our results demonstrate that utilizing hardware acceleration yields good results even without high-end equipment. While the most expensive platform (10X price of the least expensive one) converges to the initial accuracy target 10-30% and 60-70% faster than the other two, the differences among the platforms almost disappear at slightly higher accuracy targets. In addition, our results further highlight both the difficulty of evaluating ASR systems due to the complex, long, and resource-intensive nature of the model training in this domain, and the importance of establishing benchmarks for ASR.