About Me

Hi, I’m Sid, a third year PhD candidate studying Computer Science at UCLA. I’m advised by Professor Baharan Mirzasoleiman. My research focus is on data curation for learning with limited supervision i.e. selecting or generating the best small subsets of data for training, to reduce costs without sacrificing accuracy, especially for learning algorithms with limited supervision. I aim to develop practically effective and theoretically rigorous approaches to solving these problems.

Open Office Hours: In an effort to pay forward all the help I’ve received in my journey so far in pursuing a career in ML research, I am dedicating 1-2 hours each week for open office hours. This is best suited for relatively junior students (undergraduate/masters) since I’m not very experienced myself :). If you’d like to chat about research, grad school or anything else, please fill out this form.

In my free time, I like to write (https://medium.com/@sjoshi804), read about philosophy and run.

Highlights

  • Foundations of Data-efficient Machine Learning Tutorial @ ICML ‘24: (Slides, Video) Gave a 2 hour tutorial at ICML ‘24 on principled approaches to data curation / pruning for efficient learning!
  • MM-GEN: MM-GEN is the first VLM data curation method that enables fully-automated data curation to improve VLM models o on downstream tasks, requiring as few as 50 reference examples from the task. Code available on GitHub!
  • CLIPCov: CLIPCov selects subsets of pre-training data to enable data-efficient contrastive language-image pre-training (CLIP) (AISTATS ‘24). Speed up your CLIP model training by over 50% with theoretically-grounded data-efficiency!
  • SAS: SAS selects subsets of pre-training data to enable data-efficient contrastive SSL (ICML ‘23). Give it a spin to try out data-efficient SSL - speed up SSL training by over 40%!
  • SpuCo: SpuCo is a Python package developed to make research on address spurious correlations effortless. Check it out!

News

Publications

[1] Siddharth Joshi, Jiayi Ni and Baharan Mirzasoleiman, Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-Training of Deep Networks, ICLR 2025.

[2] Siddharth Joshi, Arnav Jain, Ali Payani and Baharan Mirzasoleiman, Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity, AISTATS 2024.

[3] Yihao Xue, Siddharth Joshi, Dang Nguyen and Baharan Mirzasoleiman, Understanding the Robustness of Multi-modal Contrastive Learning to Distribution Shift, ICLR 2024.

[4] Yihao Xue, Eric Gan, Jiayi Ni, Siddharth Joshi and Baharan Mirzasoleiman, Investigating the Benefits of Projection Head for Representation Learning, ICLR 2024.

[5] Siddharth Joshi and Baharan Mirzasoleiman, Data-Efficient Contrastive Self-supervised Learning: Most Beneficial Examples for Supervised Learning Contribute the Least, ICML 2023.

[6] Yihao Xue, Siddharth Joshi, Eric Gan, Pin-Yu Chen and Baharan Mirzasoleiman, Which Features are Learnt by Contrastive Learning? On the Role of Simplicity Bias in Class Collapse and Feature Suppression, ICML 2023 (Oral).

[7] Siddharth Joshi*, Yuhan Liu* and Baharan Mirzasoleiman, Low Rank Pruning via Output Perturbation, Sparsity in Neural Networks Workshop 2022.

* = equal contribution

Preprints

[1] Siddharth Joshi, Besmira Nushi, Vidhisha Balachandran, Varun Chandrasekaran, Vibhav Vineet, Neel Joshi, Baharan Mirzasoleiman, MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation, arXiv.

[2] Siddharth Joshi, Yu Yang, Yihao Xue, Wenhan Yang and Baharan Mirzasoleiman, Towards Mitigating Spurious Correlations in the Wild: A Benchmark & New Datasets, arXiv.