About Me

Hi, I’m Sid. I’m a Member of Technical Staff at DatologyAI where I lead data curation for Vision-Language Models (VLMs), and a fourth year PhD candidate studying Computer Science at UCLA, advised by Professor Baharan Mirzasoleiman. My research focus is on data curation for efficient and robust learning i.e. selecting or generating the best subsets of data for training, to reduce costs without sacrificing accuracy. I aim to develop practically effective and theoretically rigorous approaches to solving these problems.

Open Office Hours: In an effort to pay forward all the help I’ve received in my journey so far in pursuing a career in ML research, I am dedicating 1-2 hours each week for open office hours. This is best suited for relatively junior students (undergraduate/masters) since I’m not very experienced myself :). If you’d like to chat about research, grad school or anything else, please fill out this form.

In my free time, I like to write (https://medium.com/@sjoshi804), read about philosophy and run.

Highlights

  • Foundations of Data-efficient Machine Learning Tutorial @ ICML ‘24: (Slides, Video) Gave a 2 hour tutorial at ICML ‘24 on principled approaches to data curation / pruning for efficient learning!
  • DatBench: DatBench is a cleaned evaluation suite for VLMs that satisfies three desiderata: faithfulness, discriminability, and efficiency. By converting multiple-choice to generative tasks and filtering blindly solvable/mislabeled samples, DatBench achieves 13x average speedup (up to 50x) while closely matching the discriminative power of original datasets across 33 datasets spanning nine VLM capabilities.
  • MM-GEN: MM-GEN is the first VLM data curation method that enables fully-automated data curation to improve VLM models on downstream tasks, requiring as few as 50 reference examples from the task. Published in DMLR 2025. Code available on GitHub!
  • CLIPCov: CLIPCov selects subsets of pre-training data to enable data-efficient contrastive language-image pre-training (CLIP) (AISTATS ‘24). Speed up your CLIP model training by over 50% with theoretically-grounded data-efficiency!
  • SAS: SAS selects subsets of pre-training data to enable data-efficient contrastive SSL (ICML ‘23). Give it a spin to try out data-efficient SSL - speed up SSL training by over 40%!
  • SpuCo: SpuCo is a Python package developed to make research on address spurious correlations effortless. Check it out!

News

Publications

[1] Siddharth Joshi, Besmira Nushi, Vidhisha Balachandran, Varun Chandrasekaran, Vibhav Vineet, Neel Joshi, et al., MM-GEN: Principled and Generalizable Data Curation for Enhancing Task Performance in VLMs, Journal of Data-centric Machine Learning Research (DMLR) 2025.

[2] Siddharth Joshi, Jiayi Ni and Baharan Mirzasoleiman, Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-Training of Deep Networks, ICLR 2025.

[3] Siddharth Joshi, Arnav Jain, Ali Payani and Baharan Mirzasoleiman, Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity, AISTATS 2024.

[4] Yihao Xue, Siddharth Joshi, Dang Nguyen and Baharan Mirzasoleiman, Understanding the Robustness of Multi-modal Contrastive Learning to Distribution Shift, ICLR 2024.

[5] Yihao Xue, Eric Gan, Jiayi Ni, Siddharth Joshi and Baharan Mirzasoleiman, Investigating the Benefits of Projection Head for Representation Learning, ICLR 2024.

[6] Siddharth Joshi and Baharan Mirzasoleiman, Data-Efficient Contrastive Self-supervised Learning: Most Beneficial Examples for Supervised Learning Contribute the Least, ICML 2023.

[7] Yihao Xue, Siddharth Joshi, Eric Gan, Pin-Yu Chen and Baharan Mirzasoleiman, Which Features are Learnt by Contrastive Learning? On the Role of Simplicity Bias in Class Collapse and Feature Suppression, ICML 2023 (Oral).

[8] Siddharth Joshi*, Yuhan Liu* and Baharan Mirzasoleiman, Low Rank Pruning via Output Perturbation, Sparsity in Neural Networks Workshop 2022.

* = equal contribution

Preprints

[1] Siddharth Joshi, Hao Yin, Raghav Adiga, Riccardo Monti, Andres Carranza, Alyssa Fang, Aidan Deng, Amro Abbas, et al., DatBench: Discriminative, Faithful, and Efficient VLM Evaluations, arXiv 2026.

[2] Andres G. Carranza, Kyle Mentzer, Riccardo P. Monti, Alyssa Fang, … Siddharth Joshi, et al., ÜberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset, arXiv 2026.

[3] Pratyush Maini, Vishaal Dorna, Parth Doshi, Andres Carranza, … Siddharth Joshi, et al., BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining, arXiv 2025.

[4] Lee Merrick, Alyssa Fang, Andres Carranza, Aidan Deng, Amro Abbas, Ben Larsen, Chad Blakeney, … Siddharth Joshi, et al., Luxical: High-Speed Lexical-Dense Text Embeddings, arXiv 2025.

[5] Siddharth Joshi, Yu Yang, Yihao Xue, Wenhan Yang and Baharan Mirzasoleiman, Towards Mitigating Spurious Correlations in the Wild: A Benchmark & New Datasets, arXiv.