Measuring Training ROI Without the Consultants: Kirkpatrick for People With Day Jobs

Somebody above you has asked what the training budget is actually buying, and the honest answer right now is "completion rates and a smile-sheet average." You know that's not an answer. They know it's not an answer. And the industry's response — a consulting engagement and an ROI methodology with decimal points — is a worse answer wearing a better suit.

I've owned training and QA for organizations where the stakes were real: 200+ contact-center agents across 20+ pharmaceutical brands, provider networks in the six figures, emergency-operations software where "did the training work" gets tested by an actual emergency. Here's the measurement system that survives contact with a day job.

The model everyone cites and nobody finishes

Kirkpatrick's four levels, in plain language. Level 1 — Reaction: did they like it? Level 2 — Learning: can they pass a check on it? Level 3 — Behavior: are they doing it differently on the job a month later? Level 4 — Results: did a business number move?

The industry's dirty secret is the drop-off: nearly everyone measures L1 (the smile sheet is automatic), most measure some L2 (the quiz at the end), and then it falls off a cliff — only a small fraction ever measure L3, and L4 is mostly conference-talk material. Which means most training functions are measuring the two levels that don't predict anything. People rate entertaining trainings highly and forget them by Thursday; people pass quizzes on Friday and revert to old habits Monday. The value lives at L3 and L4 — and the good news is that measuring them is less work than the consultants need you to believe.

Start before you build: the behavior sentence

The single highest-leverage measurement act happens before any content exists. Finish this sentence: "Thirty days after this training, [role] will [observable behavior] instead of [current behavior]."

"Reps will quote the new pricing from the configurator instead of the old rate card." "Managers will document performance conversations within 48 hours instead of never." If you can't finish the sentence, you don't have a training need — you have a vague anxiety, and no amount of content cures a vague anxiety. If you can finish it, you've just defined your L3 measurement, your content scope, and your pass/fail bar in one sentence. (This is the front end of backward design — the full Stage 1 playbook is here.)

The minimum viable measurement stack

L1, demoted to two questions. Keep the smile sheet but stop pretending it's evaluation. Two questions only: "How confident are you that you can do [the behavior] now?" and "What's one thing that would have made this more useful?" The first is a leading indicator with actual signal; the second is your improvement backlog. Star ratings of the instructor measure charisma, and you weren't buying charisma.

L2, but performance, not recall. A quiz asking them to recognize the right answer measures short-term memory. Have them do the thing once — handle the mock ticket, run the scenario, produce the document — and you're measuring something that transfers. One performance check beats twenty multiple-choice items, and with AI generating scenario variations, building it is no longer the bottleneck it was.

L3, the 30-day pulse — this is the whole ballgame. Thirty days out, two short asks. To the learner: "Have you done [the behavior] since the training? What got in the way?" To their manager: "Have you observed [the behavior]? More, same, or less than before?" That's it — two emails, three questions, five minutes of someone's time. The response data will be imperfect and it will still be infinitely more decision-useful than completion rates, because "what got in the way" answers are where you find the real blockers: the tool that doesn't match the training, the incentive pointing the other way, the manager who tells them to skip it under pressure. (We automated exactly this pulse into LearningByDesign because the discipline, not the difficulty, is why nobody does it — but two calendar reminders and a form get you the same data.)

L4, one metric per program, claimed honestly. Pick the one business number the behavior sentence should move — handle time, error rate, time-to-productivity for new hires, audit findings — and look at it 60–90 days out. Then make the honest claim, which is contribution, not causation: "The error rate fell 30% in the quarter we trained and changed the checklist." Anyone selling you a clean causal ROI percentage on a normal training budget is selling decimal-point theater. Executives, in my experience, trust the honest version more — it sounds like someone who knows what they're looking at.

What to stop measuring

Completion rates (attendance, not effectiveness — report it, never headline it). Hours of training delivered (a cost masquerading as an output). Average stars per course (charisma). Number of courses in the library (the graveyard playbook applies to course catalogs too — a library nobody applies is shelf-ware with an LMS license).

The one-page scorecard

Per program, four lines: the behavior sentence; % of learners who passed the performance check (L2); % doing the behavior at 30 days, per the pulse (L3); the one business metric, before and after, with the honest contribution claim (L4). That page is your budget defense, your renewal argument, and your kill-list generator — programs that fail L3 two cycles running get redesigned or retired, which is itself a result worth money.

Closing

You don't need an ROI consultancy. You need the behavior sentence before you build, a performance check instead of a recall quiz, the 30-day pulse — two emails, three questions — and one honest business metric per program. Run that stack and you'll be measuring more than the majority of enterprise L&D departments, on a budget of approximately zero.

Training that can't name the behavior it changes isn't training. It's content.

— Tom

Name the behavior before you build the course

The free Course Outliner turns a topic and an audience into measurable objectives and an assessment tied to the outcome — the behavior sentence, done for you, in seconds. No signup.

Try the free Course Outliner →

About the author

Tom Christian is the founder of LearningByDesign, an AI-native learning platform that builds real training — needs analysis to course to evaluation — without hiring a Director of L&D.

He has spent twenty years inside training, learning, and quality at scale — building and running programs at Guardian Life, ConnectiveRx, and Horizon Blue Cross Blue Shield. He writes about course design that changes behavior, the discipline of starting with outcomes, and measuring whether learning actually happened.