Skip to content

Workshop 5: AI Evaluation Handout

We work through this together, live, in the session. Follow along in your own clone, using the AI tool of your choice (Claude / Copilot / ChatGPT / Cursor / other).

Record your results in your logbook as we go.


Part 0: Your comprehension-preserving system prompt

AI is fastest at taking the thinking off your plate, which is exactly the risk. Cognitive offloading is when you hand the thinking to the tool and stop building the mental model yourself. With AI it is fast and invisible: the code works, you accept it, and you never form the understanding that writing it yourself would have forced. The tell is simple: if you could not re-derive or explain the output without the AI in front of you, you offloaded comprehension, not just typing.

We fight that today by setting a system prompt that makes the AI keep you in the loop, and by proving comprehension after each task. We do this live, together.

Open starter/active-comprehension-prompt.md, adapt it to your style, and set it as the system prompt (or paste it at the top of the chat) before we start. Use it for every task below.

Note the one change you made to the sample prompt and why. Record this in your logbook as we go.


Part 1: Refactor task

Target file: apps/api/src/routes/reports.ts (the ~130-line god function) Prompt template used: starter/refactor-prompt.md

We run this together. As we go, capture: how many iterations on the prompt it took before the output was usable, whether the existing tests passed after applying the refactor (and what broke if not), one thing the AI got right that surprised you, and one thing the AI got wrong that you had to fix manually. Record this in your logbook as we go.

Prove comprehension (pick one and do it, live): name every helper the AI extracted and what it produces, without re-reading the code, OR change one helper (rename it and adjust a behavior) by hand, no AI, and keep the tests green.

Then, in your own words, explain why the refactor preserves the original behavior. Record this in your logbook as we go.


Part 2: Test generation task

Target file: apps/api/src/services/auth.service.ts (zero tests originally) Prompt template used: starter/test-gen-prompt.md

We run this together. As we go, capture: the number of tests the AI generated, how many passed on the first run, and how many you had to fix or delete. Record this in your logbook as we go.

Also note the most common failure mode in the AI tests:

  • Wrong import paths
  • Imagined APIs that don't exist
  • Missed edge cases (empty, null, error paths)
  • Incorrect type assertions that "compile but lie"
  • Other

Record which one you saw in your logbook as we go.

Prove comprehension (live): pick one generated test, change its assertion to something you know is wrong, and predict the exact failure before running it. Did the actual failure match your prediction? Record the prediction, the result, and any gap in your logbook as we go.


Part 2b: Mock the SDK clients (the big test-speed win)

The test:api suite takes ~12 minutes because the integration tests (billing.rollup, email.campaigns, webhooks.fanout, search.reindex, notifications.blast, plus reports.test.ts) make real HTTP through the SDK clients. The services already take an injectable client (e.g. BillingService takes client: BillingClient = new BillingClient()), so AI can swap in mocked clients for the six clients (billing / email / search / webhooks / notifications / audit). This is the payoff Workshop 3 set up: mocking the clients turns the ~12-minute test:api into seconds.

We do this together. Capture the test:api runtime before and after mocking the clients, and which clients you mocked. Record this in your logbook as we go.

Prove comprehension (live): in one sentence, explain why injecting a mocked client makes the suite fast without weakening what the test checks. Record it in your logbook as we go.


Part 3: Evaluation rubric

Together we score each piece of AI output on the four-axis rubric (1 = bad, 5 = excellent). See starter/evaluation-rubric.md for what each level means.

For both the refactor output and the test-generation output, score each axis (Correctness, Maintainability, Coverage, Idiom) with a one-sentence justification. Record your scores and justifications in your logbook as we go.


Part 4: Reflection

We discuss these together. Capture your answers in your logbook as we go.

  • Where AI saved you real time.
  • Where AI produced something you had to throw away entirely.
  • Where blind acceptance would have bitten you: name one moment where the output looked right, you almost accepted it, and understanding it caught a bug or a gap.

Then run the active-comprehension loop on one thing you kept today, and record it in your logbook:

  • Why does it work?
  • What breaks if an assumption changes (empty input, a client times out, the schema shifts)?
  • What tradeoff did you accept by doing it this way instead of the alternative?

Finally, note a rule you'd add to the team's AI policy based on what you saw today. Record it in your logbook as we go.


Keep your logbook

Keep your logbook. We reuse it in Workshop 6 and Workshop 8, and the policy notes feed directly into Session 7. Your applied refactor and AI-generated tests live in the repo; everything else (the grades, the near-misses, your comprehension-preserving system prompt) lives in your logbook's W5 section.