---
title: ['Managing', 'Tokens']
description: 'Every people process has a context window. Most organizations are stuffing it.'
---

Every people process has a context window. Long before LLMs it was full of 15-page performance review templates, 360 feedback from a dozen peers, five years of history dragged into a calibration meeting.

LLMs make this literal. And every piece of data you feed in cost tokens.

More tokens doesn't mean better output. In fact it can be worse[^1].

**What goes in**

This is a complex area because it directly affects performance _and_ cost.

Strip too much context and you get a shallow result. Pack in too much and the model risks losing signal -- finding patterns and drawing conclusions that are more novel than accurate[^2].

It's tempting to jump directly to real data for the inputs. But people processes can make this difficult. For one, the outcomes are messy. You can't A/B test a promotion decision. The feedback loops in people work play out over months or years, which makes it very hard to optimize.

Starting from a decent set of assumptions can short-circuit a lot of optimization.

Two techniques I've found effective: First go oldschool, have a domain expert generate inputs from best practice. They already know the edges of what good (if traditional) input looks like. It is an easy way to a known baseline _and_ the start of your evals.

Second, generate synthetic data and iterate. LLMs are good at producing realistic people data sets. This lets you find the right "size and shape" of inputs. Use this to expand what is possible and extend your evals.

**Tune models and settings**

A rubric check is a structured, deterministic task -- a cheaper model at low temperature handles it fine. Drafting development feedback that needs nuance and empathy may require a different model and more care with settings.

We break people processes into sub-agents, each with their own model and temperature. A compliance check runs on Haiku at temperature 0. A career development summary runs on Opus with more room to reason. The compliance agent doesn't need to understand someone's growth trajectory, and the development agent doesn't need to validate field formats. Each gets only the context it needs and returns a condensed result[^3]. This cuts cost on the mechanical steps and improves quality on the ones that matter most.

**What comes out**

[Structure steers](/structure-steers/) -- constrain output to as much structure as practical for your case.

**When data expires**

An LLM can reconcile inconsistent job titles or navigate the ambiguity of a reorg. It can't know that a glowing review was written before a PIP, or that a team's OKRs changed last month. Messiness is something LLMs handle well. Staleness actively misleads.

Tag when data was collected. Knowing someone has a certain manager is useful. Knowing they've had that manager for 2 days is a different and perhaps more powerful signal. Build expiry into your pipeline.

**PII is a token problem too**

Every piece of personally identifiable information you include is a token that can work against you. [LLMs can infer sensitive attributes](https://hai.stanford.edu/news/privacy-ai-era-how-do-we-protect-our-personal-information) -- demographic details, health status, performance trajectory -- from data points that aren't themselves PII. Don't include it unless it's essential to the task.

If you're using masking tools, know that [they introduce their own biases](https://www.liveperson.com/blog/exploring-pii-masking-bias/): names associated with certain demographic groups get masked less reliably than others.

### FOOTNOTES

[^1]: [Chroma found](https://research.trychroma.com/context-rot) that every frontier model they tested degrades as input length increases, even well within claimed limits. Separately, [researchers studying real-world context limits](https://arxiv.org/abs/2509.21361) found models fell short of their advertised context window by as much as 99% on applied tasks.

[^2]: [Liu et al. showed](https://aclanthology.org/2024.tacl-1.9/) that information in the middle of a long context gets significantly less attention than information at the beginning or end. The goal isn't minimal input, it's the _right_ input.

[^3]: Anthropic's [context engineering guide](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents) recommends sub-agent architectures for exactly this reason: focused agents that each get only the context they need, returning condensed summaries rather than passing the full history forward.
