We convene often to say the obvious. Machines can calculate at scales and speeds beyond human capacity. Humans still retain moral imagination, responsibility, and the ultimate cost of violence. The recent surge of workshops, panels, and policy fora have made one theme unavoidable: human judgment must be designed into military AI systems rather than assumed to be an external addendum.
If ‘‘human oversight’’ is to be more than a rhetorical comfort, it requires three concrete commitments. First, specify roles and authorities with surgical clarity. It is not enough to declare that a human is ‘‘in the loop’’ or ‘‘on the loop.’’ Operational doctrine must codify who may veto, who may delegate, and what metrics trigger forced human review. NIST’s AI Risk Management Framework explicitly points to human factors and oversight as measurable elements of risk governance, and it urges organizations to define processes for operator proficiency and human oversight across the lifecycle.
Second, accept that oversight is socio-technical and organizational. Workshops convened by the National Academies and related forums have emphasized that human oversight fails or succeeds in the context of culture, training, and incentives. Human operators will only exercise corrective judgment when institutions support that role through training, time, authority, and accountability. Conversely, poor organizational design converts nominal oversight into a ritual with no bite.
Third, design autonomy with calibrated delegation rather than binary choices. Recent academic work on dynamic delegation and human-AI complementarity shows that optimal configurations shift with task difficulty, environmental novelty, and the asymmetric costs of error. Systems that can adaptively defer to human judgment for high-uncertainty cases while acting autonomously on routine, well-specified tasks better preserve both effectiveness and moral control. These are not abstract papers for conference proceedings. They give engineers and commanders a vocabulary for trading speed against responsibility.
Across multiple international conversations, including UN and multistakeholder gatherings, three recurrent fault lines emerged. One is definitional: what do we mean by ‘‘meaningful human control’’ in systems that learn and change post-deployment? Another is legal and moral accountability: who answers when an algorithmic decision harms civilians or causes escalation? The third is practical: under combat stress and compressed timelines, can human oversight remain timely and informed? These are not separate problems. How we resolve the definitional question shapes legal accountability which in turn determines what is operationally feasible.
From a policy perspective, the way forward should be modest and technical, not grandiose and vague. I propose four policy levers that emerged repeatedly from recent workshops and briefings:
1) Mandate auditable intent traces. Every machine recommendation that could foreseeably lead to lethal effects should carry a recorded provenance: model version, confidence metrics, training-data provenance, and the decision rule used to escalate or defer to a human agent. Such trails make post hoc accountability practicable and improve operator situational awareness.
2) Require operational red-teaming and human-in-the-loop exercises as part of certification. Tabletop ethics discussions are necessary but insufficient. Units must practice with degraded sensors, adversarial inputs, and time pressure until the human-machine choreography either proves robust or is shown to require redesign. The National Academies and government workshops have highlighted the value of realistic testing that includes organizational human factors.
3) Define risk tolerances and delegation thresholds in advance. Commanders cannot improvise a moral calculus on the battlefield. Organizations should codify tolerances for false positives, false negatives, and permissible autonomous actions tied to mission type and legal constraints. NIST’s roadmap explicitly recommends methods to develop reasonable risk tolerances and governance structures that allocate authority.
4) Invest in the human side of the equation. Oversight is not a box to check. It is a role that requires training, psychological preparation, and institutional protection. If human agents are to intervene against a machine recommendation, the chain of command must protect them from perverse incentives and career penalties for slowing operations. Workshops from academic, international, and defense communities have consistently raised this point.
Technical limits must be named candidly. Contemporary AI systems are brittle in the face of distributional shift, adversarial manipulation, and novel tactical contexts. There is therefore no technological escape from the need for human judgment; there is only design complexity and tradeoffs. Those tradeoffs must be explicit. Treat human oversight as an engineering requirement, not a legal fig leaf.
Finally, human oversight must be pluralistic and multinational. No single doctrine will satisfy all legal, moral, and strategic communities. That is why the UN, multilateral commissions, standards bodies, and academic consortia have a central role in shaping norms that reconcile operational needs with humanitarian constraints. International dialogue has moved from abstract admonitions toward concrete proposals on lifecycle governance, auditable systems, and operational testing. Those are the conversations worth investing in.
If there is a moral of these convenings it is simple and unsettling. Machines will change who decides, but they must not be permitted to decide who bears responsibility. To insist otherwise is to trade moral agency for efficiency and then wonder why responsibility evaporates. Those who design, deploy, and command AI-enabled systems carry, and must accept, that burden. Our task is not to remove humans from the loop. It is to redesign systems so the human in the loop can actually see, judge, and be accountable.