Skip to content

AI output evaluation rubric: Workshop 5

For every piece of AI-generated code you apply, grade it on each of the four axes below. Use the descriptions to pick a level honestly.

Correctness

Does the code do what it claims to do?

  • 5 (Excellent): All existing tests pass. Behavior matches the spec exactly. No regressions in any code path.
  • 4 (Good): Tests pass; minor edge case missed but unlikely to ship.
  • 3 (Acceptable): Works for the happy path; needed one or two manual fixes for clear bugs.
  • 2 (Poor): Compiled but failed multiple tests; required substantial rework.
  • 1 (Bad): Confidently wrong in ways that would have shipped without a careful review.

Maintainability

Would a teammate be able to read and modify this six months from now?

  • 5 (Excellent): Reads like a senior engineer wrote it. Names are clear, structure is obvious.
  • 4 (Good): Mostly clear, one or two questionable naming choices.
  • 3 (Acceptable): Readable but generic; would need a refactor before being a long-term part of the codebase.
  • 2 (Poor): Confusing structure, names that don't match function. Would require explanation.
  • 1 (Bad): Spaghetti. Comments contradict the code. Future-you would rewrite this.

Coverage

Does it handle edge cases or only the happy path?

  • 5 (Excellent): Empty, null, error, and boundary cases are all considered.
  • 4 (Good): Most edge cases handled; one minor gap.
  • 3 (Acceptable): Happy path solid, common errors handled, but edges incomplete.
  • 2 (Poor): Only the happy path. Would break on common real-world inputs.
  • 1 (Bad): Doesn't even handle the happy path completely.

Idiom

Does it match the conventions of the surrounding code?

  • 5 (Excellent): Indistinguishable from code written by the rest of the team.
  • 4 (Good): Mostly matches; one stylistic deviation (different error pattern, etc.).
  • 3 (Acceptable): Functional but stylistically obvious as AI output.
  • 2 (Poor): Imports unfamiliar libraries, uses patterns the team doesn't use elsewhere.
  • 1 (Bad): A different language ecosystem entirely (Python idioms in TypeScript, etc.).

How to use this in code review

When grading, be specific. The score itself is less valuable than the one-line justification. "Coverage: 2, didn't handle the case where the tasks array is empty, which would have crashed the response builder" is what you want to write. That's review-grade feedback.

How this rubric feeds Session 7

In Session 7 you write an AI usage policy for the team. The patterns of failures you grade today directly suggest policy rules. For example:

  • If you score Coverage low repeatedly → policy: "AI-generated test files must cover at least one error case per function."
  • If you score Idiom low repeatedly → policy: "AI output must be linted and formatted to match team conventions before commit."
  • If you score Correctness low repeatedly → policy: "AI-generated logic in security-related modules requires a human-written test suite, not an AI-generated one."

Record your grades in your logbook as we go. They're not throwaway.