Let It Rate

< Let It Rate

Structured output doesn't just constrain LLMs, it steers them. I've historically treated output purely as an interface concern -- shape it for downstream systems and move on. That was a mistake.

In my work with People Systems you're often working with messy, incomplete data -- fragmented data for one employee, rafts of unstructured documents for another. In this environment, prompting and input presentation matter immensely. But I've found the output schema is also an important lever.

A hiring debrief agent that returns {"recommendation": "advance", "concerns": ["limited backend experience"], "confidence": 0.7} is more auditable than three paragraphs of equivocation. It also guides the model to commit rather than hedge in natural language^[1].

We had one process where the output needed to avoid a direct rating -- in this case it was prior to a formal rating and any kind of "jumping the gun" ran against what the process was trying to achieve. We built guardrails in the prompts, but something "rating-like" kept emerging in the output^[2].

The open-loop, heterogeneous nature of the data was pulling hard against us^[3]. If you don’t give a behavior a place to go, it leaks into the rest of the output.

The fix: I moved the rating to its own structured field, then discarded it. The model kept trying to rate. Fighting that was expensive and fragile. Giving it a place to put the rating, then throwing it away, was cheaper, more reliable, and kept the rest of the output clean^[4].

This cuts both ways. Your evals need to do some heavy-lifting. Research has shown that naive format constraints can degrade reasoning -- the model gets forced into answering before it's done thinking^[5]. Well-designed schemas scaffold reasoning. Poorly designed ones short-circuit it. The difference is whether the structure gives the model room to think before it commits.

Footnotes

Though research suggests models already under-hedge -- they answer decisively even when uncertain. Structure amplifies this, which is good for machine readability but can mask genuine uncertainty. ↩
The standard toolkit here is well established: few-shot examples, negative instructions, output validators, and retry loops. These work and should be your first reach. The structured field approach was for when those techniques became brittle and difficult to maintain under the volume and variety of inputs. ↩
People processes have notoriously long and ambiguous feedback loops. A promotion calibration plays out over months; a hiring debrief's accuracy only becomes clear after the candidate has been in the role for a quarter or more. You can't A/B test a performance review. This makes the typical tweak-and-test cycle extremely difficult -- by the time you know if a change helped, the context has shifted. Even if the change is right now, it might not be in new circumstances. ↩
This "release valve" pattern doesn't have a dedicated study, but the underlying principle -- that ordering reasoning before final answers improves performance -- is supported. Dylan Castillo tested reasoning-first vs answer-first field ordering on LiveBench questions and found a 13 percentage point accuracy difference (p < 0.01). Put the fields the model needs to think through before the fields it needs to commit to. ↩
Tam et al.'s EMNLP 2024 paper found significant reasoning declines under strict format restrictions. CRANE (ICML 2025) explains the mechanism: constrained grammars don't leave room for intermediate reasoning steps. ↩