← ContentsHome
Chapter 7

Measuring What Matters

Four metrics that actually reflect flow and context friction: TTCAA, Flow Session Length, Reorientation Time, and Context Provision Ratio.

Architecture and principles are necessary, but they're not sufficient.

If you're introducing AI coding support to a team, you need to answer: Is this actually helping, or just creating noise?

To answer that, you need metrics that reflect flow, not just usage.


Why Traditional Metrics Fail

Lines of code generated — meaningless. More code isn't better code.

Daily active users — tells you who clicked something, not whether it improved their work.

Suggestion acceptance rate — can be high even if the tool destroys flow (accepts a suggestion, then immediately reverts it).

Token throughput — measures AI speed, not human productivity.

None of these measure the context tax.


Four Metrics That Actually Matter

These metrics directly reflect flow and context friction:


1. Time To Commit After AI (TTCAA)

Definition: Time between first AI interaction for a task and the next commit/PR related to that task.

What it shows:

  • When AI interactions produce useful changes quickly, TTCAA is low
  • When AI interactions lead to long ping-pong cycles, TTCAA is high

How to measure:

  • Log each AI invocation with timestamp
  • Track commits with timestamps
  • Calculate: commit_time - first_ai_invocation_time for the same task
  • Average over many tasks

Typical values:

  • Excellent: <10 minutes
  • Good: 10-20 minutes
  • Mediocre: 20-40 minutes
  • Poor: >40 minutes (suggests lots of back-and-forth)

Example:

  • Sarah's TTCAA: 70 minutes (many fruitless AI interactions before finally getting help from Miguel)
  • Miguel's TTCAA: 15 minutes (one AI interaction → quick fix → commit)

2. Flow Session Length

Definition: Average uninterrupted period a developer spends in focused work inside their main tools before a context-breaking action.

What it shows:

  • Longer sessions = deeper focus
  • Frequent breaks = lots of task switching and context reloading

What counts as "breaking":

  • Alt-tabbing to browser/chat
  • Switching to email/Slack
  • Long idle periods (>2 minutes)

What doesn't count:

  • Switching between editor and terminal
  • Running tests
  • Reading other files in the project

How to measure:

  • IDE plugins can track focus time
  • Window management tools can log active window
  • Calculate streaks of continuous focus

Typical values:

  • Excellent: 25-45 minutes (before natural break)
  • Good: 15-25 minutes
  • Mediocre: 10-15 minutes
  • Poor: <10 minutes (constant interruption)

3. Reorientation Time

Definition: Time from returning to the editor after an AI interaction to making the next meaningful edit.

What it shows:

  • Low reorientation = tool kept you in context
  • High reorientation = you're rebuilding your mental map

What counts as "meaningful edit":

  • Adding/changing code (not just formatting)
  • Running a test
  • Making a commit

What you're measuring:

  • The scroll-and-remember time
  • "Wait, what was I doing?"
  • Re-reading surrounding code to rebuild understanding

How to measure:

  • Log timestamp when AI response completes
  • Log timestamp of next actual edit
  • Calculate: edit_time - ai_response_time

Typical values:

  • Excellent: <30 seconds
  • Good: 30-90 seconds
  • Mediocre: 90-180 seconds
  • Poor: >3 minutes

Sarah vs Miguel:

  • Sarah's average reorientation: ~5 minutes per interaction
  • Miguel's average reorientation: ~20 seconds

This single metric captures the essence of context tax.


4. Context Provision Ratio

Definition: Ratio of context automatically gathered by the tool to context manually provided by the developer.

What it shows:

  • High ratio = tool does the work
  • Low ratio = developer is the serialization layer

How to measure:

For each AI interaction, count:

  • Auto context: lines of code/config/logs/errors the tool gathered
  • Manual context: lines the developer copied/pasted or typed in explanation
Context Provision Ratio = Auto Context / (Auto Context + Manual Context)

Typical values:

  • Excellent: >0.9 (tool gathers 90%+ of context)
  • Good: 0.7-0.9
  • Mediocre: 0.4-0.7
  • Poor: <0.4 (developer doing most of the work)

Example:

  • Sarah's tool: ~0.1 (she manually provided almost everything)
  • Miguel's tool: ~0.95 (tool auto-gathered test, error, logs, config, diff)

Implementing Metrics: Practical Guide

For Engineering Leaders:

Start simple. Pick one metric based on what you can instrument:

Easiest to start: Reorientation Time

  • Requires: IDE plugin or tool with logging
  • Effort: Low
  • Value: High (directly measures context tax)

Medium difficulty: TTCAA

  • Requires: AI tool logging + git commit tracking
  • Effort: Medium
  • Value: High (measures end-to-end effectiveness)

Measure for 2 weeks:

  • Baseline without AI
  • 2 weeks with AI tool
  • Compare

Look for:

  • Is TTCAA better than no-AI baseline?
  • Is reorientation time reasonable?
  • Do developers feel less fragmented?

For Tool Builders:

Instrument everything from day one:

// Example telemetry schema
{
  event: "ai_invocation",
  timestamp: "2024-01-15T10:23:00Z",
  developer_id: "hashed_id",
  context_auto: {
    files: 3,
    lines_code: 120,
    lines_logs: 45,
    config_files: 2
  },
  context_manual: {
    lines_typed: 8  // developer's question
  },
  response_time_ms: 4200,
  applied: true,
  edit_after_response_seconds: 18
}

Track:

  • When AI is invoked
  • What context was gathered (auto vs manual)
  • How long until response
  • Whether response was applied
  • Time to next edit
  • Time to next commit

Dashboard views:

  • TTCAA percentile distribution
  • Reorientation time trend over time
  • Context provision ratio by project
  • Flow session length correlation with AI usage

What Good Looks Like

After implementing context-aware AI, you should see:

  • TTCAA decreases by 40-60% compared to chat-first tools
  • Flow sessions lengthen by 30-50%
  • Reorientation time drops below 1 minute on average
  • Context provision ratio >0.85

If you're not seeing these improvements, the tool isn't respecting flow—no matter how good the model is.