You can't A/B test a performance review. A promotion calibration plays out over months. A hiring debrief's accuracy only becomes clear after someone has been in the role for months or more. Even when you can prove a change helped, the cycle has moved on. Aside from managing privacy and bias, this is the core constraint of applying AI to people processes: the feedback loop is too slow to iterate your way to good inputs.
This makes the design and shape of your inputs more important than ever. Every people process has a context window: 15-page review templates, 360 feedback from a dozen peers, five years of history dragged into a calibration meeting. Before, this was in people's heads. LLMs make it literal.
In the case of an LLM, every piece of data costs tokens, and more tokens doesn't mean better output[1]. The data going in is patchy, self-reported, subjective, inconsistent from person to person[2]. Pack in too much and the model buries signal in the middle[3]. Strip too much and you get a shallow result. And because of that slow feedback loop, you can't easily tell which you've done.
In most AI applications, you solve this with existing data and iteration. Run it, measure, adjust, repeat. But that's much less effective here. So where possible I work from informed baselines instead. The usual prompt toolkit still applies[4]. The difference here is what you layer on top.
First, have a domain expert draft the inputs. Not "consult a domain expert" -- literally have them write the performance summary, the hiring debrief, the calibration brief from their own best practice. They already know where managers under-report, what a useful debrief contains, how much history is too much. This gives you a known-good baseline and the start of your evals. The hazard is that you bake in the limitations of the legacy process. But when you can't iterate quickly, a functional prior beats a blank page.
Second, generate synthetic data. LLMs are good at producing realistic edge cases and exploring the solution space -- a fake employee with three years of patchy reviews, a manager who writes two-sentence assessments, a high performer whose metrics don't match their peer feedback. This compresses the feedback loop that you don't otherwise have.
The final approach is a well-known principle. Include a human-in-the-loop. Many systems are moving to a hybrid model where both human and AI processes work together. For example, a performance review might include both assessments. It would only be flagged for discussion if the delta between the AI and the manager is meaningful. Most of the time this highlights information that the AI is missing and needs to be reliable. As I mention in People Systems are the Next Codebase; the AI isn't doing the review, it makes the review worth having.
You typically don't get fast feedback in people systems. So you don't get to easily discover good inputs. You have to start with the right size and shape -- and then try to break them before reality does.