ICML 2024 Tutorial: Foundations of Data-Efficient Learning

Abstract

Over the last decade, machine learning models have achieved remarkable success by learning from large amounts of data. This is best exemplified by the recent rise of foundation models that are trained on billions of examples. Training on massive data is, however, dependent on exceptionally large and expensive computational resources, and incurs substantial financial and environmental costs, due to the significant energy consumption. To reduce these costs, there has been a recent surge of interest in data-efficient learning techniques to train machine learning models on smaller subsets of carefully-chosen training examples. The field is, however, filled with many heuristics that seem contradictory at times, and is increasingly difficult and diverse to grasp for a non-informed audience. The goal of this tutorial will be to provide a unifying perspective, by discussing recent theoretically-rigorous approaches for data-efficient machine learning. We will discuss rigorous techniques for data-efficient supervised learning, and self-supervised contrastive pre-training. Then, we will focus on foundation models and discuss data selection for (pre-)training large vision-language models, such as CLIP. We will conclude by discussing challenge and providing guidelines for data-efficient training of large language models (LLMs).

Speakers

Siddharth Joshi

UCLA CS

Siddharth Joshi is a PhD candidate at UCLA, advised by Professor Baharan Mirzasoleiman. He focuses on designing algorithms for data-efficient and robust learning with limited supervision. His research has advanced self-supervised and multimodal contrastive learning and explored data properties that enhance algorithm robustness. His work is recognized at prestigious conferences such as ICML'23a, ICML'23b, ICLR'24a, ICLR'24b, AISTATS'24, and SNN'22. Siddharth actively implements large-scale data-selection for vision-language pre-training and has received the Amazon Doctoral Student Fellowship.

Baharan Mirzasoleiman

UCLA CS

Baharan Mirzasoleiman is an Assistant Professor in the Computer Science Department at UCLA, leading the BigML research group. Her research focuses on enhancing the sustainability, reliability, and efficiency of machine learning by improving big data quality and developing rigorous data selection methods for robust learning. She also works on improving models and learning algorithms, with applications in medical diagnosis and environmental sensing. Before UCLA, Baharan was a postdoctoral fellow at Stanford University with Jure Leskovec and earned her Ph.D. at ETH Zurich under Andreas Krause, receiving an ETH medal for Outstanding Doctoral Thesis. She has been recognized as a Rising Star in EECS by MIT and has received an NSF Career Award, a UCLA Hellman Fellows Award, and an Amazon Research Award.

Contact

If you have any questions or need further information, feel free to reach out:

Email: sjoshi804@cs.ucla.edu