About Me

Hi, I’m Sid, a third year PhD candidate studying Computer Science at UCLA. I’m advised by Professor Baharan Mirzasoleiman. My research focus is on data curation for learning with limited supervision i.e. selecting or generating the best small subsets of data for training, to reduce costs without sacrificing accuracy, especially for learning algorithms with limited supervision. I aim to develop practically effective and theoretically rigorous approaches to solving these problems.

Open Office Hours: In an effort to pay forward all the help I’ve received in my journey so far in pursuing a career in ML research, I am dedicating 1-2 hours each week for open office hours. This is best suited for relatively junior students (undergraduate/masters) since I’m not very experienced myself :). If you’d like to chat about research, grad school or anything else, please fill out this form.

In my free time, I like to write (https://medium.com/@sjoshi804), read about philosophy and run.

Highlights

Foundations of Data-efficient Machine Learning Tutorial @ ICML ‘24: (Slides, Video) Gave a 2 hour tutorial at ICML ‘24 on principled approaches to data curation / pruning for efficient learning!
MM-GEN: MM-GEN is the first VLM data curation method that enables fully-automated data curation to improve VLM models o on downstream tasks, requiring as few as 50 reference examples from the task. Code available on GitHub!
CLIPCov: CLIPCov selects subsets of pre-training data to enable data-efficient contrastive language-image pre-training (CLIP) (AISTATS ‘24). Speed up your CLIP model training by over 50% with theoretically-grounded data-efficiency!
SAS: SAS selects subsets of pre-training data to enable data-efficient contrastive SSL (ICML ‘23). Give it a spin to try out data-efficient SSL - speed up SSL training by over 40%!
SpuCo: SpuCo is a Python package developed to make research on address spurious correlations effortless. Check it out!

News

June 2025: I’m excited to be joining DatalogyAI as a Research Scientist Intern!
January 2025: MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation preprint on arXiv!
January 2025: Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-Training of Deep Networks accepted to ICLR ‘25!
October 2024: Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-Training of Deep Networks preprint on arXiv!
July 2024: Will be giving tutorial on Foundations of Data-Efficient Learning at ICML ‘24!
June 2024: Will be interning this summer at Microsoft Research (AI Frontiers Team) under Dr. Neel Joshi!
February 2024: I have successfully advanced to candidacy!
January 2024: Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity is accepted to AISTATS 2024!
January 2024: Understanding the Robustness of Multi-modal Contrastive Learning to Distribution Shift and Investigating the Benefits of Projection Head for Representation Learning are accepted to ICLR 2024!
June 2023: Towards Mitigating Spurious Correlations in the Wild: A Benchmark & New Datasets preprint on arXiv!
May 2023: Data-Efficient Contrastive Self-supervised Learning: Most Beneficial Examples for Supervised Learning Contribute the Least accepted to ICML 2023!
May 2023: Which Features are Learnt by Contrastive Learning? On the Role of Simplicity Bias in Class Collapse and Feature Suppression accepted to ICML 2023 for an oral (top 2%)!
July 2022: Low Rank Pruning via Output Perturbation at Sparsity in Neural Networks Workshop

Publications

[1] Siddharth Joshi, Jiayi Ni and Baharan Mirzasoleiman, Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-Training of Deep Networks, ICLR 2025.

[2] Siddharth Joshi, Arnav Jain, Ali Payani and Baharan Mirzasoleiman, Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity, AISTATS 2024.

[3] Yihao Xue, Siddharth Joshi, Dang Nguyen and Baharan Mirzasoleiman, Understanding the Robustness of Multi-modal Contrastive Learning to Distribution Shift, ICLR 2024.

[4] Yihao Xue, Eric Gan, Jiayi Ni, Siddharth Joshi and Baharan Mirzasoleiman, Investigating the Benefits of Projection Head for Representation Learning, ICLR 2024.

[5] Siddharth Joshi and Baharan Mirzasoleiman, Data-Efficient Contrastive Self-supervised Learning: Most Beneficial Examples for Supervised Learning Contribute the Least, ICML 2023.

[6] Yihao Xue, Siddharth Joshi, Eric Gan, Pin-Yu Chen and Baharan Mirzasoleiman, Which Features are Learnt by Contrastive Learning? On the Role of Simplicity Bias in Class Collapse and Feature Suppression, ICML 2023 (Oral).

[7] Siddharth Joshi*, Yuhan Liu* and Baharan Mirzasoleiman, Low Rank Pruning via Output Perturbation, Sparsity in Neural Networks Workshop 2022.

* = equal contribution

Preprints

[1] Siddharth Joshi, Besmira Nushi, Vidhisha Balachandran, Varun Chandrasekaran, Vibhav Vineet, Neel Joshi, Baharan Mirzasoleiman, MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation, arXiv.

[2] Siddharth Joshi, Yu Yang, Yihao Xue, Wenhan Yang and Baharan Mirzasoleiman, Towards Mitigating Spurious Correlations in the Wild: A Benchmark & New Datasets, arXiv.