Human Compute

14 May

Why AI Companies Need Evals for the People in the Loop

There is a strange asymmetry emerging inside AI-native companies.

On one side, we are becoming extremely precise about the machines. We measure inference speed, benchmark performance, memory, tool use, coding ability, retrieval accuracy, alignment behaviour, latency, cost per token, context length, and failure modes under increasingly specific conditions. If a model is going to be trusted inside a serious workflow, we want to know where it is strong, where it is brittle, how it behaves under pressure, and what kind of supervision it still requires.

On the other side, the human beings supervising those systems are treated as if their capacity is obvious, private, or effectively infinite.

This asymmetry is going to matter more as agentic systems become normal inside companies. A human in an AI-native organisation is no longer just doing the work. They are reviewing, interpreting, correcting, prioritising, escalating, and integrating the work of systems that do not sleep, do not fatigue, and do not experience overload in the human sense. The machine side of the organisation is becoming more measurable at exactly the moment the human side is becoming more loaded.

The gap is not that companies measure nothing about people. They measure a lot, often badly. Engagement surveys, performance reviews, retention dashboards, pulse checks, burnout questionnaires, manager ratings, productivity metrics — the modern company is not short of human data. What it is short of is a useful theory of human capacity under acceleration.

Most of the existing instruments sit downstream of the real variable. They tell you, usually too late, whether people are unhappy, exhausted, disengaged, or thinking about leaving. They do not tell you whether the human system still has enough available capacity to do the work being asked of it.

The phrase I have started using is human compute. I do not love it, because humans are not machines, and the point of this essay is not to make them more machine-like. But the phrase does useful work. It names a missing category: the cognitive, emotional, physiological, relational, and creative capacity of the people who are supposed to supervise, challenge, integrate, and improve the output of increasingly capable AI systems.

As machine compute becomes more abundant, human compute becomes more consequential.

I want to be clear up front that I am not arguing for surveillance, quantified management, or another dashboard that quietly converts human beings into red, amber, and green indicators. That would be the most predictable and least intelligent version of this idea. I am arguing something narrower and, I think, more useful: that AI-native companies will need a way to understand the state of the human substrate with the same seriousness they bring to the state of their technical systems. Not because people are machines, but because they are not.

The people in the loop have nervous systems. They have finite attention. They need recovery. Their judgement changes under threat. Their creativity changes under pressure. Their ability to disagree, surface bad news, and repair conflict is state-dependent. If the company depends on human oversight, then the condition of the humans doing the oversight is part of the performance architecture of the company.

The missing eval

The model-evaluation world has a useful instinct: do not assume capability; test it. Do not assume robustness; stress it. Do not assume a system will behave well under deployment conditions just because it performs well in a clean environment. Build evals that reveal the difference between apparent performance and real-world reliability.

That instinct has not yet crossed the membrane into organisational design.

Most companies operate as if human capacity is either fixed, private, or self-evident. If someone is senior, capable, and well-paid, the organisation assumes they can keep integrating whatever the system throws at them until they cannot. If a team is moving fast, the organisation assumes it is functioning well until attrition, conflict, or degraded execution proves otherwise. If an executive appears calm, the organisation assumes they are clear. If an engineer is responsive at midnight, the organisation reads commitment rather than a possible collapse of boundary.

The result is that the organisation often discovers the limits of human compute only after it has already overdrawn the account.

This matters more in the age of agentic AI because the load is changing shape. The old bottleneck was often production: not enough people to write the code, draft the document, analyse the data, produce the options, answer the customer, test the workflow. The new bottleneck is increasingly integration. Agents can generate more than the organisation can intelligently absorb. They produce drafts, summaries, plans, analyses, recommendations, tickets, code, messages, and suggested next actions. Each of these may reduce one kind of work while increasing another: review, judgement, prioritisation, correction, escalation, and context management.

A company can therefore become more productive and less intelligent at the same time.

This is the failure mode a human-compute eval should catch. Not whether people are happy in a general sense. Not whether they like their manager. Not whether they endorse the company values on a survey. Those things matter, but they are not the core question. The core question is whether the human system still has enough available capacity to think clearly, decide well, coordinate honestly, recover adequately, and create original work under the actual conditions of the job.

What human compute is, and is not

Human compute is not IQ. It is not hours worked. It is not output volume. It is not how fast someone responds on Slack. It is not how many agent workflows one person can supervise before the company starts congratulating itself on leverage while quietly destroying judgement.

A better definition is this:

Human compute is the available capacity of a person or team to process complexity, regulate under pressure, make decisions, collaborate, recover, and create.

That definition is deliberately broad, because the relevant capacity is not purely cognitive. A person can be intellectually brilliant and physiologically overloaded. A team can be full of high-IQ people and still unable to disagree cleanly. A founder can make correct arguments while transmitting enough threat that nobody brings them bad news. A research group can generate enormous output while losing the deeper states of attention in which original insight usually appears.

The mistake is to treat cognition as if it floats above the body and the group. It does not. Working memory, attention, cognitive flexibility, creativity, and moral judgement are all affected by stress, sleep, threat, trust, and recovery. This is not a romantic claim. It is one of the more ordinary facts about human beings.

A useful human-compute eval would therefore have to measure more than workload. It would need to include at least five layers.

The first is cognitive load: how much a person or team is holding, how often they are context-switching, how many decisions they are responsible for, how many active workstreams they are integrating, and how much deep work remains possible.

The second is physiological regulation: sleep, recovery, heart-rate variability, resting activation, subjective energy, and the ability to return to baseline after intensity.

The third is emotional state: irritability, threat activation, confidence, anxiety, urgency, shame, meaning, and the difference between disciplined intensity and disguised overactivation.

The fourth is relational capacity: trust, repair, psychological safety, quality of disagreement, candour, escalation, and whether people can challenge each other without turning the work into a status contest.

The fifth is creative capacity: access to flow, originality, synthesis, curiosity, tolerance for ambiguity, and the ability to stay with a problem long enough for something non-obvious to emerge.

The reason to put these in the same frame is that they interact. A sleep-deprived leader has less cognitive flexibility. A threat-activated engineer is less likely to surface bad news early. A team with low trust uses more cognitive load managing each other than solving the problem. A founder who cannot recover turns every ambiguous decision into an emergency. A researcher who spends the day triaging AI outputs may lose the absorbed-attention state that made them valuable in the first place.

The dashboard, if there is one, should not pretend these are separate variables.

How it might be measured

The simplest version would combine three sources of data: self-report, behavioural workflow data, and optional biofeedback.

Self-report is still underrated when it is done well. I do not mean long employee surveys that disappear into the people team and return three months later as a laminated insight. I mean short, lightweight, high-trust check-ins that ask questions close to the work:

How much cognitive load are you carrying this week? How clear are your priorities? How much recovery did you get? How many decisions are you holding open? How much of your work is deep work versus supervisory review? Where are you pretending to have capacity you do not actually have? Where is the system asking you to integrate too much?

These are not therapeutic questions. They are operational ones.

Behavioural workflow data can add a second layer: meeting load, message volume, after-hours activity, number of active projects, number of agent workflows supervised, decision latency, rework rate, escalation frequency, reopening of decisions, and the ratio of deep work to reactive work. None of these variables is meaningful in isolation. A high message volume may be fine in one team and catastrophic in another. The point is to identify patterns over time and mismatches between apparent output and actual capacity.

Biofeedback, if used at all, should be optional, aggregated carefully, and governed with extreme restraint. Heart-rate variability, sleep, resting heart rate, respiration, and recovery data can be useful, but they become corrosive the moment employees feel monitored rather than supported. The ethical boundary is simple: physiological data should belong to the person first, and the organisation should only receive what is necessary to redesign work, not to evaluate individual worth.

This is where most companies would be tempted to ruin the idea.

The goal is not to build a productivity panopticon. The goal is to give people and teams enough signal to notice when they are overdrawn before the cost appears as attrition, conflict, safety failure, or bad judgement.

A useful human-compute eval would have to be designed around trust. Employees would need to know what is measured, why it is measured, who sees it, what decisions can and cannot be made from it, and how the data helps them rather than exposing them. Without that, the eval itself becomes another source of load.

The organisational nervous system

The most useful level of measurement is probably not the individual. It is the team.

Individuals matter, of course. But the more interesting question is whether the team has enough collective capacity to do the work. Is it surfacing bad news early? Is it repairing after conflict? Are decisions being made by clarity or exhaustion? Is the founder acting as an integrator of last resort for too many unresolved tensions? Are people escalating because something matters, or because the system has not clarified ownership? Are agents reducing cognitive load, or multiplying it invisibly?

This is why I prefer the phrase Human Systems Performance Eval to Human Compute Eval, even though the second is more provocative. The object of measurement is not the isolated human. It is the human system.

A frontier AI company is not only a set of individuals. It is an organisational nervous system: attention, threat, trust, information, recovery, meaning, and decision-making moving through a network of people and tools. When that nervous system is regulated, the company can move quickly without becoming stupid. When it is dysregulated, speed turns into noise.

The best eval would therefore function less like a score and more like an early-warning system.

It would help a founder notice that the senior team is making decisions faster but reopening more of them. It would help a research lead notice that their team is producing more summaries but generating fewer original insights. It would help an operations lead notice that agent workflows have reduced task completion time while increasing review burden on two key people. It would help a company notice that its most capable employees are not underworked or disengaged, but overloaded in ways that look like competence until they leave.

That is the bet: not that we can measure human beings perfectly, but that we can measure enough of the right things to stop pretending human capacity is infinite.

What founders should do with it

The first implication is that cognitive load belongs on the same operational map as compute, runway, latency, velocity, and headcount. Not because it is more important than those things, but because it determines whether those things convert into intelligent action.

The second implication is that agent rollout should include human-capacity review. Before adding a new agentic workflow, ask: who reviews the output, who carries the escalation load, who resolves ambiguity, who wakes up to the overnight work, and what recovery surface is being removed? If the workflow saves ten hours of execution but creates fifteen hours of supervisory vigilance across three senior people, it has not created leverage. It has created disguised load.

The third implication is that teams need capacity budgets. Not in the bureaucratic sense of permissioning every task, but in the simple sense that a team should know when it is exceeding its ability to integrate. A capacity budget might include active priorities, meeting load, decision volume, agent workflows, emotional intensity, recovery debt, and the amount of deep work still possible. The question is not whether people are busy. The question is whether the system is still capable of clear thought.

The fourth implication is that leaders need to be measured, at least privately, on regulatory impact. Some leaders make their teams smarter. Others make their teams devote half their intelligence to managing the leader's state. This is usually the most expensive hidden variable in the company. It is also one of the hardest to name without a mature psychological function in the room.

The fifth implication is that the company should treat recovery as infrastructure. If human compute matters, recovery is not a perk. It is the maintenance schedule. You would not run technical infrastructure indefinitely without cooling, monitoring, and repair. You cannot run human infrastructure that way either, except that when it fails, it fails as judgement, trust, creativity, and senior retention.

The bet

I want to close carefully, because this idea is easy to misread.

The strongest version of the argument is not that humans should be managed like machines. It is almost the opposite. The argument is that humans are the non-machine part of the system, and that this makes their state more important, not less. As AI systems become faster, more persistent, and more productive, the human layer becomes the place where meaning, ethics, judgement, creativity, trust, and accountability have to live.

That layer is finite.

Companies that ignore this will continue to confuse output with intelligence. They will ship more, message more, decide more, automate more, and then wonder why their senior people become brittle, why bad news arrives late, why creativity thins out, and why the company feels strangely less clear as it becomes more productive.

Companies that take it seriously will look different. They will ask not only what their models can do, but what their humans can still integrate. They will design agentic workflows around the limits of attention and recovery. They will treat trust and repair as load-bearing infrastructure. They will measure enough to protect capacity without reducing people to metrics.

An eval for human compute will never be as clean as an eval for model performance. That is not a weakness of the concept. It is the point. Human beings are messier, more contextual, more relational, and more state-dependent than machines. Any measurement system that forgets that will become part of the problem.

The useful version of the idea is more humble and more serious: build a language, a set of signals, and a rhythm of review that helps the organisation notice when the human substrate is overdrawn.

In the age of agentic AI, the companies that win will not be the ones that ask humans to keep up with machines. They will be the ones that design systems where machines expand human capacity without quietly consuming it.

That distinction — between augmenting human compute and exhausting it — may become one of the more important design questions of the next decade.

gerhard bronn