YouTube is such a great multimodal dataset—videos, auto-generated captions, and ...

YouTube is such a great multimodal dataset—videos, auto-generated captions, and real engagement data all in one place. That’s a strong starting point for training, even before you filter for quality. Microsoft’s Phi-series models already show how focusing on smaller, high-quality datasets, like textbooks, can produce great results. You could totally imagine doing the same thing with YouTube by filtering for high-quality educational videos.

Down the line, I think models will start using video generation as part of how they “think.” Picture a version of GPT that works frame by frame—ask it to solve a geometry problem, and it generates a sequence of images to visualize the solution before responding. YouTube’s massive library of visual content could make something like that possible.