Your AI Metrics Are Measuring the Wrong Thing
A research-backed framework for measuring sophistication, not just activity.
Most organizations measure AI adoption the way they measure gym memberships. How many people signed up. How often they swipe in. Maybe how long they stay. None of which tells you whether anyone is actually getting stronger.
The AI equivalent: prompt counts, hours logged, tokens consumed, self-assessed skill levels. These numbers are easy to collect, and at most companies they look encouraging. Adoption is up. Usage is growing. The dashboards are green.
But adoption is not sophistication. And the gap between the two is enormous.
What the Research Actually Shows
KPMG and researchers at the University of Texas at Austin spent eight months studying this gap. They analyzed 1.4 million AI prompts from roughly 2,500 professionals — not surveys, not self-reports, but actual conversation logs at scale. The question was simple: what does sophisticated AI use look like, and how do you tell it apart from routine use?
The headline finding: about 90% of employees used AI regularly. Only approximately 5% used it in ways that generated differentiated value. That’s a 17:1 ratio between adoption and sophistication, and most organizations can’t see it because they’re measuring the wrong dimension entirely.
The researchers identified four behavioral patterns that consistently predicted sophisticated use. Not prompt length. Not frequency. Behaviors: treating AI as a reasoning partner rather than accepting first outputs, delegating complex multi-step tasks with clear constraints, applying AI across diverse task types instead of just writing assistance, and sustaining longer working-session-style interactions.
Here’s the part that caught me off guard. The most sophisticated users weren’t the youngest employees — they were above manager level. The conventional wisdom says junior employees are more natural with these tools. The data says otherwise. There’s a real difference between being comfortable with AI and being good at getting results from it. Comfort is about familiarity. Sophistication is about judgment.
The Problem with Averages
When I started building a scoring framework from this research, I ran into an interesting design problem. A weighted average of behavioral dimensions sounds clean, but it lies in predictable ways.
Consider someone who writes long, detailed initial prompts and sustains multi-turn conversations. Their Interaction Depth score is high — maybe an 8 or 9. But they never refine outputs. Never push back. Never ask the model to check its reasoning or explore alternatives. Their Iterative Reasoning score is a 2.
A weighted average might land them at “Proficient.” But they’re not proficient. They’re just verbose. The length of the prompt isn’t the signal. What the user does with the output is.
This is why the framework I built includes gating criteria — floor rules that prevent misclassification. You can’t reach the Advanced tier unless both Task Complexity and Iterative Reasoning hit at least 7 out of 10, regardless of what your weighted average says. Those two dimensions are the strongest differentiators in the research, and they carry 55% of the total score.
The gating mechanism is the single most useful idea in the framework. It forces honest measurement.
What This Means for How You Train
The dimension-level data tells you something activity metrics never can: where to invest in training.
If Iterative Reasoning is consistently low across your organization, another “intro to prompting” workshop won’t help. The gap isn’t in how people write prompts — it’s in how they think about the interaction. They need to learn to treat AI as a reasoning partner: assign roles, provide examples, test assumptions, ask the model to verify its own logic.
If Task Complexity is low, the problem is different. People aren’t delegating hard enough. They’re using AI for tasks they could do themselves in roughly the same time, instead of delegating the genuinely complex, multi-step work where AI creates real operating margin.
The dimension scores give you a specific diagnosis. The diagnosis gives you a specific intervention. That’s the difference between “use AI more” and “here’s what to change about how you use it.”
The Uncomfortable Implication
If only 5% of users are sophisticated at a firm where 90% are active — a firm that had invested heavily in AI tools and training — then sophisticated use doesn’t happen organically. Making tools available and running training sessions gets you to 90% adoption. It does not get you to sophistication.
Getting there requires measuring the right things, making specific behaviors visible and expected, and building the feedback loops that help people see the gap between where they are and where they could be. Activity metrics can’t do that. Behavioral metrics can.
You can’t operate what you can’t measure. And right now, most organizations are measuring the equivalent of gym swipes.
I built a full playbook with the five scoring dimensions, weighted formula, gating rules, score anchors, and a printable worksheet for manual scoring. I also built a Claude Skill if you want to get sophistication scoring within you conversation, along with tips on how to improve.

