Generalization vs. Memorization: What It Means to Learn

My mom teaches third grade. Because she’s a great teacher, she gives homework.

On that homework, kids do a bunch of addition and subtraction problems. They try it, they get feedback (from my mom, or from the answer key, or from a grade), and then they take a test later.

Here’s the key detail: on the test, the kids see similar problems that test the same ideas… but they don’t see the exact same problems.

That one design choice forces something deep: the kids must generalize, not just memorize.

And if you want one mental model for understanding both human learning and machine learning, it’s this:

Learning means generalizing in new situations—not just repeating what you’ve already seen.

Homework that teaches vs. homework that trains you to memorize

We’ve all had the opposite kind of class too: the teacher hands out a multiple-choice “study guide,” and then the same questions show up on the test. You can ace the exam by memorizing the pattern rather than understanding the concept.

That might raise your grade, but it doesn’t necessarily raise your capability.

Researchers in education and psychology often distinguish remembering a fact from being able to transfer an idea to a new situation (what we’d casually call “generalizing”). In one set of experiments, Butler showed that repeated testing didn’t just help students answer the same questions later—it improved performance on new inferential questions, including questions that required applying the ideas in different contexts^{[butler2010transfer]} .

Crucially, the test format changes what students practice. Scouller found that students reported more surface / memorization-oriented approaches when preparing for multiple-choice exams, but more deep-learning approaches when preparing essay-style assignments—and they also perceived multiple choice questions as targeting lower-level cognition more often than essays^{[scouller1998assessment]} . In a large introductory biology course, Stanger-Hall similarly found that a multiple-choice-only exam format was associated with less “cognitively active” studying, while adding constructed-response questions was associated with better performance on a cumulative final, especially on higher-level questions^{[stangerhall2012mc]} .

This maps almost perfectly onto the homework-vs-test idea: good homework is “practice plus feedback,” but good tests ask you to reconstruct the concept in a slightly new form^[1] .

Multiple choice can help—but it can also create “false knowing”

None of this is an argument that multiple-choice questions are always bad. But they can create a trap: when you repeatedly show plausible wrong answers, students can later recall the wrong option as if it were true.

In one study, multiple-choice tests improved later recall of the right facts—but also increased the chance that students would later “learn” the multiple-choice lures (the wrong options) as incorrect answers^{[roediger2005]} .

That’s a nice reminder that memorization isn’t just “not learning.” Sometimes it’s learning the wrong thing—because the evaluation rewarded the wrong behavior.

Sherlock Holmes and two kinds of “reasoning”

Sherlock Holmes is usually described as doing “deduction,” and that’s a useful contrast for thinking about generalization.

Deduction is applying a rule to a case.
- Rule: If someone is in a rush, they might not tie their tie.
- Observation: This person didn’t tie their tie.
- Conclusion: Maybe they were in a rush.

That’s the vibe of Holmes: he seems to have a library of “if-then” rules and fires them like a logical machine.

But most learning—in humans and in modern AI—looks more like:

Induction is inferring a rule from many cases.
- You see many rushed people who didn’t tie their tie.
- You (tentatively) generalize: rushing → sloppy tie.

Induction is what makes learning feel powerful: it’s how we go from “I’ve seen five examples” to “I can handle the sixth.”

However, induction has a permanent enemy.

The enemy of induction: correlation vs. causation

When you generalize from examples, you can accidentally generalize the wrong thing—because the world contains correlations that are real, strong, and totally non-causal.

A clean way to say it is: confounding variables can make two things look related even when neither causes the other. OpenIntro gives a classic example: ice cream sales and boating accidents both rise in summer—because temperature influences both. That doesn’t mean ice cream causes accidents^{[openintroStudyDesign]} .

Spurious correlations (the fun version)

Tyler Vigen’s Spurious Correlations site is the famous “pirates vs something” style of example. It’s funny, but the underlying lesson is deadly serious: if you search hard enough, you can find extremely strong correlations between unrelated things^{[vigenSpurious]} .

For example, pirate attacks globally correlate with gasoline pumped in Switzerland with a reported Pearson correlation of about 0.93 on that page^{[vigenPirateGas]} . Unless pirates are secretly managing Swiss fuel logistics, this is almost certainly not causal.

And that brings us to machine learning.

What we ask machine learning systems to do (and why it’s hard)

Machine learning is basically industrial-scale induction:

You show the system examples (inputs + desired outputs).
It adjusts itself based on feedback (errors).
You hope it learns a pattern that will keep working on new examples.

But “pattern” is ambiguous. A model can latch onto:

a real causal signal (e.g., “tumors often have this texture”),
or a spurious shortcut (e.g., “these images came from this hospital, and that hospital sees more cancer”).

This idea is now often called shortcut learning: models find decision rules that score well on standard tests but fail when the situation changes in a meaningful way^{[geirhos2020]} .

Humans do this too, by the way. If the test always repeats the homework questions, we “learn” the shortcut: memorize the answer key.

A real example: medical imaging models that learn the hospital, not the disease

Here’s a concrete version of the “hospital scan example” from the previous section.

Zech and colleagues trained deep learning systems to detect pneumonia from chest X-rays and found a painful truth: models that performed great on “internal” test data (from the same hospital systems used during development) could drop substantially on “external” data (from a different hospital system). One reason: the model could often identify the hospital system itself from the image, and hospital identity correlated with disease prevalence and other factors^[zech2018] .

In other words: the model didn’t just learn “pneumonia.” It learned a messy bundle of signals, some medical and some institutional—and the institutional ones were dangerously predictive.

This is exactly correlation-vs-causation, but operationalized:

Correlation: “Hospital A images tend to have more pneumonia labels.”
Non-causal shortcut: “Detect Hospital A.”
Failure mode: deploy somewhere else; the shortcut stops working.

How we test for generalization (instead of memorization)

At a high level, the fix is simple:

Train on one set of data.
Evaluate on data the model didn’t see.

But there’s nuance in how we do that well.

Training, validation, test

A common split is:

Training set: what the model learns from (like homework).
Validation set: what we use to tune choices (like practice quizzes).
Test set: the final exam (ideally touched once).

If you keep “peeking” at the test set and making changes, you can accidentally overfit to the test too—like a teacher who keeps rewriting the curriculum until last year’s test average looks good.

One of the nice things about training modern AI systems is that we can watch the learning process as it happens.

During training, we repeatedly measure the model’s error^[2] on:

the training set (the “homework problems” it’s allowed to learn from), and
a validation set (fresh problems it hasn’t trained on, used as a reality check)

If the model is learning general patterns, both curves usually improve together: training error drops, validation error drops too (often with a small gap, because the training set is easier—it’s what we optimized for).

If the model is starting to memorize quirks of the training set, training error can keep getting better while validation error stops improving—or even gets worse. That’s the classic signature of overfitting.

Healthy training: generalization

Both training and validation loss drop together (with a small gap). This is what we mean by generalization in practice.

Overfitting: memorization

Training loss keeps improving, but validation loss starts getting worse. The model is fitting training-set quirks that don’t transfer.

In real projects, this is why techniques like early stopping^[3] are so common: you’re literally stopping at the point where the model is most “third-grade-test ready,” before it starts cramming the answer key.

Confusion matrices: the microscope for mistakes

While monitoring the training and validation loss is extremely important, we also want to understand how well the model does when forced to make hard decisions.

For a binary classifier (say, “pneumonia” vs “no pneumonia”), we can summarize performance with a confusion matrix:

A binary confusion matrix (structure)

	Predicted Positive	Predicted Negative
Actual Positive	TP (true positive)	FN (false negative)
Actual Negative	FP (false positive)	TN (true negative)

From that, we can define common metrics:

\text{Precision} = \frac{TP}{TP + FP}

(1)

\text{Recall} = \frac{TP}{TP + FN}

(2)

Precision and recall are often in tension. If you want fewer false alarms, you push precision up. If you want to miss fewer real cases, you push recall up. The “right” tradeoff depends on the real-world stakes—especially in medicine, fraud, safety, and security^[4] .

If you want a single number that balances both precision and recall, you’ll see F1, ROC AUC, and friends in essentially every ML paper and library^{[sklearnMetrics]} ^[5] .

So where do large language models fit?

Modern AI systems—especially large language models (LLMs)—are trained on massive datasets and can exhibit both:

impressive generalization (writing code in a new style, explaining a new topic),
and occasional memorization (repeating specific training snippets).

But a key point: LLMs exhibit a new kind of generalization that is not the same as the generalization we see in traditional machine learning models. This is why LLMs have exploded in popularity, and why they are so dangerous if not used responsibly.

In classical machine learning, we typically have a clear separation between training and evaluation data, but they are from the same underlying task (e.g. image classification, speech recognition, etc.). However, LLMs are trained initially on a massive dataset of text to predict the next word in a sequence. While they are tested on a test set of the same task, notably that is NOT the task we use LLMs for in practice.

LLMs are used nowdays for a wide range of tasks, including writing code, explaining complex topics, and even generating images. While major LLM providers have built robust testing and evaluation frameworks, it becomes much harder to know if the model is actually generalizing or just memorizing, as the model was never explicitly trained to “do your taxes” or “write a novel”. These are what we call “emergent” behaviors that are not explicitly taught or learned, but rather are discovered through the model’s interactions with the data^{[emergent2022]} .

One striking result: researchers have demonstrated training data extraction attacks, where an attacker can query a language model and recover verbatim sequences from its training set—even sequences that appeared only once in the training data^{[carlini2021]} .

That doesn’t mean LLMs are “just memorizing.” It means the boundary between “memorized” and “generalized” is not clean—and you shouldn’t assume that a good chatbot implies robust understanding.

The throughline

The reason I like the “my mom’s third grade homework” story is that it’s the simplest version of a profound principle:

If the test repeats the homework, you reward memorization.
If the test changes the surface but keeps the concept, you reward generalization.
If the world changes in a deeper way (new hospital, new scanner, new user population), you find out whether you learned the thing you meant to learn.

Whether you’re teaching kids or training neural networks, the core job is the same: design feedback loops that reward the right behavior.

Practical takeaway questions

When you see a model (or a student) doing well, ask:

What counts as “new” here? Is the test genuinely different from the practice?
Could there be a shortcut? What correlations exist that might let you cheat?
What would change in the real world? New school, new hospital, new users, new policy, new camera, new language.
What mistakes matter most? Look at the confusion matrix, not just one headline metric.

That’s what it means to learn: not “I can repeat it,” but “I can use it when it matters.”

Until next time, let’s make intelligence less artificial.