Member of Technical Staff · DatologyAI

Better models through better data

I’m Sid, and I lead multimodal data curation at DatologyAI — choosing, filtering, and synthesizing the data that models train on. My thesis is that curating the data is the highest-leverage way to improve a model. My work spans the research behind those curation methods, the production-grade training infrastructure that runs them, and evaluation research — because good evals are upstream of all good ML research.

Siddharth Joshi
Open office hours — I set aside a couple of hours a week for mentorship — research, careers, or whatever's on your mind. Grab a slot →
// What I build at DatologyAI

Leading multimodal data curation

Over the past year, our team built a VLM training stack from scratch, rethought how VLMs are evaluated, and ran a lot of exhilarating new research on data curation. A few highlights.

Flagship 2026 · Vision-Language Models

20/20 Vision Language Models

A Prescription for Better VLMs through Data Curation Alone

DatologyAI · Joshi et al.

Hold the architecture, recipe, and compute fixed — vary only the pretraining data. Our pipeline (multimodal deduplication, quality filtering, mixture design, and both task-agnostic and task-specific synthetic data, all with rigorous multimodal decontamination) produces much better vision-language models — reproducibly.

+11.7ppacross 20 public VLM benchmarks @ 2B
~17×less train compute vs. InternVL3.5-2B (~10pp better)
3.3×lower response FLOPs vs. Qwen3-VL-4B, near-frontier @ 4B
+57.1ppgrounding on RefCOCO
−67%cross-seed variance (2.47 → 0.82pp)
Evals 2026 · arXiv

DatBench

Discriminative, Faithful & Efficient VLM Evaluations

DatologyAI · Joshi et al.

VLM benchmarks are broken: multiple-choice inflates scores, many questions are solvable without the image, and full suites are ruinously expensive. DatBench fixes this by converting MCQ to generative, filtering blind-solvable (up to 72%) and mislabeled samples, and selecting only high-discrimination items.

13×avg speedup (up to 50×)
33datasets · 9 capabilities
40%of the data, same signal
Infra Internal

The VLM training stack

From scratch, with the multimodal team

None of the curation results matter without a training and evaluation stack you can trust. We built and optimized a full VLM pretraining pipeline from scratch — so controlled, data-only experiments are reproducible at scale.

1B→4Bconsistent gains across scale
data-onlycontrolled experimentation
// Also contributed to

Scaling data curation beyond images

2025 · arXiv

BeyondWeb

Lessons from scaling synthetic data for trillion-scale pretraining.

2026 · arXiv

ÜberWeb

Multilingual curation for a 20-trillion-token dataset.

2025 · arXiv

Luxical

High-speed lexical-dense text embeddings for retrieval and curation.

// Where the ideas come from

Research roots

Before Datology, I spent my PhD at UCLA working on the theory and practice of data-efficient learning — selecting or generating the best subsets of data to train on. It’s a straight line to the work I do now.

ICML '24

Foundations of Data-Efficient Learning

A 2-hour tutorial on principled data curation & pruning. Tutorial · Video

DMLR '25

MM-GEN

First fully-automated VLM data curation method — improves downstream tasks from as few as 50 reference examples. Paper · Code

AISTATS '24

CLIPCov

Data-efficient CLIP pretraining — speed up training 50%+ with theoretically-grounded subset selection. Project

ICML '23

SAS

Subset selection for data-efficient contrastive self-supervised learning — 40%+ faster SSL. Code

Toolkit

SpuCo

A Python package that makes research on spurious correlations effortless. Docs

ICML '23

Class Collapse & Feature Suppression

What contrastive learning actually learns — the role of simplicity bias. Oral, top 2%. Project

// Why I do this

About

Five years ago I left a comfortable software job in Big Tech to start a PhD. Last year I left the PhD to join Datology. Both were the same bet: I wanted to do research in the truest sense of the word.

Not chasing deadlines — sitting with messy, unsolved, sometimes previously unheard-of problems and slowly making sense of them. Deep learning hooked me precisely because it blends empiricism, mathematics, and real-world impact all at once, and I’ve chased that feeling across every risk I’ve taken since. I care about research that’s both rigorous and genuinely useful — and, just as much, about growing the people I get to work with along the way.

Take risks. Bet on yourself.

When I’m not training models, you’ll find me reading philosophy, out on a run, or writing — long-form at Medium and, more lately, on Substack.

// Papers

Publications & preprints

// Recent

News

  • 20/20 Vision Language Models — the culmination of our multimodal team’s year of work — is out. arXiv · blog

  • DatBench: Discriminative, Faithful, and Efficient VLM Evaluations preprint on arXiv.

  • ÜberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset preprint on arXiv.

  • Grew into leading multimodal data curation at DatologyAI as Member of Technical Staff.

  • Luxical: High-Speed Lexical-Dense Text Embeddings preprint on arXiv.

  • BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining preprint on arXiv.

  • MM-GEN published in the Journal of Data-centric ML Research (DMLR) 2025.

  • Joined DatologyAI as a Research Scientist Intern, then stayed on as MTS.

  • Dataset Distillation via Knowledge Distillation accepted to ICLR ‘25.

  • Gave the Foundations of Data-Efficient Learning tutorial at ICML ‘24.

  • Interned at Microsoft Research (AI Frontiers Team).

  • Three papers accepted across AISTATS ‘24 and ICLR ‘24; advanced to PhD candidacy.

  • Two papers at ICML ‘23 (one Oral, top 2%).