There is a popular story about AI datasets that goes like this: companies woke up one day, scraped the internet, trained models, and now we are stuck arguing about it.
That story is missing the interesting part.
The more honest story is that we have been building the machinery for “copying the web into computers” for decades, and we built it for reasons that made a lot of sense at the time: search, archiving, research, accessibility, and basic usability. Once that machinery existed, training data started to look less like a special new category of thing, and more like “yet another consumer” of the same pipelines.
This post is my attempt to connect the dots: from crawlers, to search engines, to Common Crawl, to the jump from text to images, and finally to why captioning and accessibility quietly mattered more than most people realize.
Crawling came first
When most people think “search engine”, they imagine a magical box that answers questions. Under the hood, a search engine starts with something much simpler: a program that downloads pages.
Google calls this crawling, done by automated programs like Googlebot [googleHowSearchWorks] . A crawler discovers pages by following links, fetches the contents, and stores what it found so the system can later answer queries quickly.
If you zoom out, crawling is a kind of industrialized reading. It is not reading for pleasure, it is reading for indexing. The point is not to enjoy a page, but to make the web searchable. And more specifically, searchable by words.
The crucial detail for our dataset story is that “indexing” does not just mean saving the raw webpage somewhere. Indexing means extracting structure and signals: the text you see, metadata, and also the little pieces of text attached to images, like the alt attribute
[googleHowSearchWorks]
.
That is already the shape of a multimodal dataset: an image and some nearby text that might describe it.
Common Crawl: the web, as a public dataset
Once you can crawl the web, the next obvious move is to store the crawl and share it.
That is what Common Crawl does. It is a nonprofit founded in 2007 that maintains a free, open repository of web crawl data, with hundreds of billions of pages spanning many years [commonCrawlHome] .
This matters because it changed the default posture from “only big search engines can afford to crawl” to “a lot of people can build on top of the same raw material”. Researchers, startups, journalists, and eventually AI labs all found themselves downstream of the same web-scale archive.
If you want a single sentence version of the last decade of AI dataset history, it might be: once the web became a dataset, everyone started treating it like one.
We have fought about this before: Google Books as a preview
If this all feels new, it is worth remembering that “copying lots of text so people can search it” has already been tested in court.
In the Google Books project, Google worked with libraries to digitize books and make them searchable, showing small snippets in results [effGoogleBooks2015] . Authors sued, arguing that scanning entire books without permission was infringement.
In 2015, the Second Circuit ruled in Google’s favor, emphasizing that copying can be justified when it is transformative, such as enabling full-text search over millions of books, and that showing only snippets does not substitute for the originals [effGoogleBooks2015] .
You do not need to agree with every implication of that case to appreciate why it mattered culturally. It taught the tech world a lesson: “copying for indexing and search” could be treated differently than “copying to republish”.
That framing shows up again and again in modern AI debates, even when the details differ.
Why text models scaled faster than image models
When large language models took off, a lot of people assumed the magic was the architecture. Transformers mattered, of course. But the training setup mattered just as much.
A next-word model trains on plain text by playing a simple game: given the words so far, predict the next one. The training labels are “free” because the internet already contains the answer: the next word in the document [brown2020gpt3] .
That is what people mean when they say LLMs are trained with self-supervised learning. Nobody had to hire an army of experts to label each sentence with the correct next word. The data supervises itself.
If you imagine trying to build an LLM in a fully supervised way, it becomes absurd fast. It would look like asking humans to sit next to the model and say “yes, correct” or “no, wrong” for every token it generates, across trillions of tokens. It is not just expensive. It is basically a different industrial era.
Text had another advantage: the web is already mostly text. Even when a page is “about images”, the thing a crawler sees is mostly strings, links, and metadata. Text is simply the native format of the web.
The harder leap: from text to pictures
If you want an AI system that can generate images from text prompts, you need to teach it a relationship between language and pixels.
Early on, the clean way to do that was to build curated datasets where humans wrote captions for images. Datasets like MS COCO did exactly this, and they were hugely important for progress in image captioning and vision-language research [lin2014coco] .
But there is a scaling problem hiding in plain sight: curated captions are expensive.
It is one thing to caption a few hundred thousand images with paid annotators. It is another thing entirely to caption hundreds of millions, or billions, with the kind of detail that would support open-ended image generation. The cost curve is brutal.
This is where a lot of “text to image” history becomes less about model architectures and more about a simple question: where do you even get enough paired examples?
Diffusion models are self-supervised, but text-conditioning is not
To understand why web-scraped image datasets became such a big deal, it helps to separate two ideas that often get blurred together.
Right now, the best AI models at generating images are called “Diffusion” models.
Diffusion models, in their core training objective, are actually self-supervised. They learn to undo noise: you corrupt an image with noise and train the model to predict the missing information, step by step [ho2020ddpm] . That part does not require captions.
But the moment you want “generate an image that matches this text prompt”, you need the model to learn an alignment between text and images. For that, you need paired examples: an image and some text that goes with it. That is supervision, even if it is weak and messy.
Modern text-to-image systems like latent diffusion models made this scalable and practical, but they still rely on huge quantities of image-text pairs to connect language to visual generation [rombach2022ldm] .
So the bottleneck was never just “how do we generate images”. The bottleneck was “how do we get enough captions” for text-to-image.
The underappreciated driver: accessibility and the legal risk of missing alt text
There is a less-known reason the modern web ended up with so much text attached to images, and it is not “because AI needed it”.
It is because blind and low-vision users needed it.
Accessibility standards have been telling web developers for a long time that images should have text alternatives. WCAG 2.0, published as a W3C Recommendation in 2008, states that non-text content should have a text alternative (Success Criterion 1.1.1) [wcag20] .
In the United States, there are multiple legal forces that made accessibility real in practice. Section 508 was amended in 1998 to require federal agencies to make electronic and information technology accessible, and later updates aligned with WCAG 2.0 [section508laws] . The Department of Justice has also stated for many years that the ADA applies to web content, and it explicitly calls out missing alt text as a common barrier for blind users [dojAdaWebGuidance] .
It is important to be precise here [1] . There was not a single new law that suddenly forced every website on earth to write perfect captions. What happened is messier: standards existed, enforcement and lawsuits created risk, and many organizations responded by taking accessibility more seriously.
And once you are in that mindset, “add alt text” becomes a routine part of publishing.
Lawsuits made “missing captions” a business problem
One reason people started caring is that they got sued.
A landmark early website accessibility case involved Target. The complaint highlighted barriers including a lack of alternative text, and the class action settled in 2008 with Target agreeing to pay damages and to make its site accessible by a deadline [w3cTargetCase] .
Even if you never followed the legal details, these cases changed the incentives for organizations. Accessibility stopped being a nice-to-have, and started being part of compliance and brand risk.
Once that happens, someone is tasked with going through a site and filling in all the missing “text for images”.
Another incentive: if it is not described in text, it is harder to find
Not every caption exists for accessibility. Some exist for reach.
Creators learned long ago that if you want your work to be found, you need to attach words to it. Captions, titles, tags, filenames, surrounding text, all of that helps search engines understand what a page is about.
This is where our story loops back to crawling. Crawlers are text-native. If you want a crawler to “see” an image, you have to give it words.
Sometimes that motivation is innocent, like “I want my photography portfolio to show up in search.” Sometimes it is compliance, like “our legal team told us missing alt text is risky.” Sometimes it is both.
By 2019, this pressure was widespread enough that even niche industries were publishing warnings about website accessibility lawsuits, including cases involving claims that sites did not work with screen readers [tractionAutoNation2019] .
Different motivations, same artifact: images with attached text.
The moment this became training data
Once you have lots of images with lots of nearby text, you can start training models that learn alignment between language and visuals.
CLIP was one of the big turning points here. It showed that if you train on a large set of image-text pairs from the internet, you can learn a representation that connects language and images in a surprisingly general way [radford2021clip] .
Shortly after, datasets like LAION-400M operationalized the idea that you can construct an enormous (image, text) dataset by filtering web-sourced pairs at scale, producing hundreds of millions of examples [schuhmann2021laion] .
At that point, “text-to-image” stopped being limited by the number of paid captions you could afford, and started being limited by your ability to filter, deduplicate, and manage the chaos of the public web.
That is why the ethical debates got louder. The scale changed the stakes.
Text-to-speech sits in the same pattern
When you move beyond text and into other modalities like images and audio, you usually need paired data.
A speech model does not just need audio. It needs audio aligned with the words that were spoken. Otherwise it cannot learn the mapping from language to sound in a controllable way.
In practice, this means that the “next-word prediction” trick that made LLMs explode in scale does not transfer cleanly to modalities where “the next thing” is not naturally labeled in your dataset. You can do self-supervised learning in audio and vision too, but if you want controllable generation from language, you need alignment data.
That is the recurring theme: pairing is expensive, until the web accidentally produces it for you.
This is why the transition from text models to image generators felt like a sudden leap to the public. It was not just that diffusion models got better. It was that the pairing problem got partially solved by the shape of the modern web.
So how did we get here?
Search engines normalized crawling. Common Crawl lowered the barrier to getting web-scale text. Next-word prediction made plain text usable as training signal without labels. Accessibility and discoverability increased the amount of text attached to images. That created a messy but massive source of image-text pairs. Finally, diffusion-based systems used those pairs to make text-to-image generation feel mainstream.
That is how we got here.
What’s next
In the next post, I want to talk about the ethical and legal questions more directly.
Is training on text for next-word prediction morally similar to training on art for image generation? Does “transformative use” apply in the same way, or does it break when the output competes with the original creator’s market? How should we think about consent, attribution, and compensation in a world where the web itself is the dataset?
Until my next post, let’s make intelligence less artificial.