On Monday, February 13, 2023, we are having a one-day hybrid workshop on Resource-Aware Data Science at ScrollBar at IT University of Copenhagen and on Zoom.
The event is collocated with the midway evaluation exams of the PhD students Ties Robroek and Ehsan Yousefzadeh-Asl-Miandoab.
The event consists of talks from the exam committee members and is open to public.
However, the exams of the students aren't public.
You can find the more detailed program and the talk info below.


Time (Danish timezone) Event
09:00 - 10:30 Ties' exam (not public)
10:30 - 11:00 Break
11:00- 11:30 In-person Talk by Rob van der Goot, ITU, Denmark
11:30 - 12:00 In-person Talk by Zoi Kaoudi, ITU, Denmark
12:00 - 12:30 In-person Talk by Matthias Boehm, TU Berlin, Germany
12:30 - 13:30 Lunch Break
13:30 - 14:00 Online Talk by Florina Ciorba, University of Basel, Switzerland
14:00 - 14:30 In-person Talk by Luca Maria Aiello, ITU, Denmark
14:30 - 15:00 Online Talk by Oana Balmau, McGill University, Canada
15:00 - 15:30 Break
15:30 - 17:00 Ehsan's exam (not public)

Talk Info

Rob van der Goot

Assistant Professor at IT University of Copenhagen

MaChAmp: Multi-task Learning to the Rescue in Resource Scarce Scenarios - Slides


In Natural Language Processing (NLP) the Wall Street Journal section of the Penn Treebanks has been the main evaluation benchmark for a long time. This dataset contains well-edited English news texts from the 1980s, and is thus not representative of most real-world language use. If we transfer to new domains and languages, current systems struggle more since our algorithms were not designed for these and training data is scarce. MaChAmp is a toolkit focusing on multi-task learning, which can be used to bridge the performance gap to more interesting language varieties. In this talk, I will walk through the abilities of the toolkit, how its made to be efficient, and how we used it to cheaply improve performance on a wide variety of tasks and languages.


Rob van der Goot got his PhD from the University of Groningen, where he worked on normalization of social media data and automatic syntactic analysis. Since then, he has been at the ITU, where he has broadened up the scope of his research which now focuses on multi-task, multi-lingual, and cross-domain natural language processing.

Zoi Kaoudi

Associate Professor at IT University of Copenhagen

Learning-based Query Optimization: What Are We Still Missing? - Slides


Query optimization is crucial for any data management system to achieve good performance. Due to the complexity of the query optimization task, it is still an open problem. Recent advancements in AI have led academia and industry to investigate learning-based techniques in query optimization. In particular, many works propose replacing the cost model used during plan enumeration with a machine learning model (typically a regression model) that estimates the runtime of a plan. This approach has already been adopted by different data management systems, including Apache Wayang (incubating), a cross-platform data processing system we have built. Interestingly though, it is well-known that what really matters in query optimization is the relative order of the query plan alternatives and not their actual cost or runtime. In this talk, we will first discuss how we can leverage a learning-to-rank approach in query optimization instead of building regression models. We will show that, by leveraging the knowledge of the rank of previously executed query plans, we can achieve up to an order of magnitude better query performance than regression models. Subsequently, as learning-based query optimization highly depends on the training data available, we will present DataFarm, an efficient data-driven training data generator that can output high-quality training data (query plans with their runtimes) in a fraction of the time required for collecting all labels manually.


Zoi Kaoudi is an Associate Professor at the IT University of Copenhagen. Her research interests lie in the intersection of machine learning systems, data management, and knowledge graphs. She has previously worked as a Senior Researcher at the Technical University of Berlin, as a Scientist at the Qatar Computing Research Institute (QCRI), as a visiting researcher at IMIS-Athena Research Center, and as a postdoctoral researcher at Inria Saclay. She received her Ph.D. from the National and Kapodistrian University of Athens in 2011. She is currently a Proceedings and Metadata Chair of ISWC 2023. Previously, she has been an Associate Editor of SIGMOD 2022, proceedings chair of EDBT 2019, co-chair of the TKDE poster track co-located with ICDE 2018, and co-organizer of the MLDAS 2019 held in Qatar. She has co-authored articles in both database and ML communities and served as a member of the Program Committee for several international database conferences. She has recently received the best demonstration award at ICDE 2022 for her work on training data generation for learning-based query optimization.

Matthias Boehm

Professor at TU Berlin

Towards Holistic Redundancy Exploitation for Data-centric ML Pipelines - Slides


Data-centric machine learning (ML) pipelines include – besides the training and hyper-parameter tuning of ML models – primitives for data cleaning, data augmentation, data validation, and model debugging to construct high-quality datasets with good coverage. Training such pipelines with state-of-the-art models is, however, a very expensive, resource- and energy-intensive process. To reduce this overhead, there are redundancy-exploiting techniques such as enforcing and leveraging sparsity, lossy and lossless compression, data sampling, as well as specialized data types, kernels and distribution strategies. Applying such techniques is, however, still a labor-intensive trial and error process, where these techniques are utilized in isolation. In this talk, we first discuss some of our prior work on sparsity exploitation in fused operator pipelines, sparsity estimation, lineage-based full and partial reuse, as well as workload-aware lossless compression. Subsequently, we share our new vision for learning the joint application and parameterization (multiplexing for short) of multiple redundancy-exploiting techniques (e.g., sparsity, compression, data composition) as part of training data-centric ML pipelines.


Matthias Boehm is a full professor for large-scale data engineering at Technische Universität Berlin and the BIFOLD research center. His cross-organizational research group focuses on high-level, data science-centric abstractions as well as systems and tools to execute these tasks in an efficient and scalable manner. From 2018 through 2022, Matthias was a BMK-endowed professor for data management at Graz University of Technology, Austria, and a research area manager for data management at the co-located Know-Center GmbH. Prior to joining TU Graz in 2018, he was a research staff member at IBM Research - Almaden, CA, USA, with a major focus on compilation and runtime techniques for declarative, large-scale machine learning in Apache SystemML. Matthias received his Ph.D. from Dresden University of Technology, Germany in 2011 with a dissertation on cost-based optimization of integration flows. His previous research also includes systems support for time series forecasting as well as in-memory indexing and query processing.

Florina Ciorba

Associate Professor at University of Basel

Multilevel Scheduling in Action for Data Analysis Pipelines with DAPHNE - Slides


Modern HPC systems exhibit multiple levels of hardware parallelism that require different scheduling techniques for optimal performance. On one side, the lack of coordination between these schedulers can result in decreased performance and inefficient resource usage. Exchange of information between schedulers at multiple levels is critical for improving scheduling decisions at each level. Establishing information exchange and enabling coordination between schedulers at different levels is a current open research problem, which we call multilevel scheduling (MLS). On the other side, DAPHNE has recently been proposed an open-source software infrastructure for integrated data analysis (IDA) pipelines, emerging to meet the increasing data processing and computing demands of scientific workflows. We see DAPHNE as a unique opportunity to incubate MLS in a scheduler — DAPHNEsched. This talk will go over a number of recent results by our group on the topics of MLS in HPC and in DAPHNE. The guiding principle of multilevel scheduling in both contexts is that both local and global (distributed) information needs to be exchanged and coordinated to efficiently exploit multilevel and heterogeneous parallel resources.


Florina Ciorba is an Associate Professor of High Performance Computing at the University of Basel, Switzerland. Between 2010-2015, she was a (tenured 2014-2015) senior scientist at the Center for Information Services and High Performance Computing at Technische Universität Dresden, Germany. From 2008 to 2010 she was a postdoctoral research associate at CAVS at Mississippi State University, USA. She received her doctoral degree in Computer Engineering in 2008 from the National Technical University of Athens, Greece. She is a member of OpenMP ARB, MPI Forum, Energy Efficiency HPC WG, SPEC HPG (with contribution SPEChpc 2021), HiPEAC, WHPC. Florina and co-authors won best paper awards at the Cluster 2019, IPDPSW ParLearning 2014, ICPDC 2014, and ISPDC 2011, and ranked at the top at ISPDC 2019. She is ACM Senior and Life Member, IEEE and IEEE Computer Society member, as well as SIAM and SIAG SC member. She is a founding member of the IDEAS4HPC Association - a Swiss chapter of WHPC. Her research interests include scalable & robust performance optimization in HPC, system & application operational data analytics, and security in HPC. More information at hpc.dmi.unibas.ch.

Luca Maria Aiello

Associate Professor at IT University of Copenhagen

Towards Health Surveillance with Social Media - Slides


In today’s heavily interconnected world, health crises develop rapidly and spread afar. Health crises burden not only people’s biological health but also their psychological and social spheres. By monitoring the population’s behavior and health conditions, public health bodies can plan prevention campaigns, promptly deploy targeted interventions, and adapt to people’s responses. We contributed to this stream of research by developing a Deep Learning tool for Natural Language Processing that extracts mentions of virtually any medical condition from unstructured social media text. We applied the tool to Reddit and Twitter posts, analyzed the clusters of the two resulting co-occurrence networks of conditions, and discovered that they correspond to well-defined categories of medical conditions. This resulted in the creation of a taxonomy of medical conditions automatically derived from online discussions. We validated the structure of our taxonomy against the official International Statistical Classification of Diseases and Related Health Problems (ICD-11), finding matches of our clusters with 20 official categories, out of 22. Based on the mentions of our taxonomy’s sub-categories on Reddit posts geo-referenced in the U.S., we were then able to compute disease-specific health scores which correlate with officially reported prevalence of 18 conditions.


Luca Aiello holds a PhD in Computer Science from the university of Turin, Italy. He is currently an Associate Professor at the IT University of Copenhagen. Previously, he worked for 10 years as a Research Scientist in the industry: at Yahoo Labs in Barcelona, and at Bell Labs in Cambridge (UK). He conducts research in Computational Social Science, an interdisciplinary field of studies that uses Social Science theories to guide the solution to Data Science problems. He is currently working on text analysis techniques that, when applied to conversations, can help understand people's social behavior and psychological well-being. His work has been covered by hundreds of news articles published by news outlets worldwide including Wired, WSJ, and BBC.

Oana Balmau

Assistant Professor at McGill University

Characterizing I/O Patterns in Machine Learning - Slides


Data is the driving force behind machine learning (ML) algorithms. The way we ingest, store, and serve data can impact end-to-end training and inference performance significantly. For instance, as much as 50% of the power can go into storage and data cleaning in large production settings. The amount of data that we produce is growing exponentially, making it expensive and difficult to keep entire training datasets in main memory. Increasingly, ML algorithms will need to access data directly from persistent storage in an efficient manner. To address this challenge, this work sets out to characterize I/O patterns in ML, with a focus on data pre-processing and training. We use trace collection to understand storage impact in ML. Key factors we are investigating include the workload type, software framework used (e.g., PyTorch, Tensorflow), accelerator type (e.g., GPU, TPU), dataset size to memory ratio, and degree of parallelism. The trace collection is done mainly through eBPF and other system monitoring tools such as mpstat, and NVIDIA Nsight. Our traces include VFS-layer calls such as read, write, open, create, etc. as well as mmap calls, block I/O accesses, CPU use, memory use, and accelerator use. Based on the trace analysis, we plan to build a synthetic I/O workload generator. The workload generator will accurately reproduce I/O patterns for representative ML workloads, simulating the computation time.


Oana Balmau is an Assistant Professor in the School of Computer Science at McGill University, where she leads the DISCS Lab. Her research focuses on storage systems and data management systems, with an emphasis on machine learning, data science, and edge computing workloads. Oana completed her PhD at the University of Sydney, advised by Prof. Willy Zwaenepoel. Before her PhD, Oana earned her Bachelors and Masters degrees in Computer Science from EPFL, Switzerland. Oana won the CORE John Makepeace Bennet Award 2021 for the best computer science dissertation in Australia and New Zealand, an Honorable Mention for the ACM SIGOPS Dennis M. Ritchie Doctoral Dissertation Award 2021. She is also a part of MLCommons, an open engineering consortium that aims to accelerate machine learning innovation, where she is leading the effort for storage benchmarking.