Military adoption of artificial intelligence and autonomy is no longer hypothetical. The Department of Defense, having updated its Autonomy in Weapon Systems directive, insists that autonomous and semi-autonomous weapons be designed so commanders and operators can exercise appropriate levels of human judgment over the use of force. This policy crystallizes a practical truth: safety in military AI is not principally solved inside silicon but inside human institutions and human cognition.

Policy statements are necessary but insufficient. The DoD’s five ethical principles for AI — responsible, equitable, traceable, reliable, and governable — and the Responsible AI Strategy articulate what ought to guide development and acquisition. These frameworks make an important conceptual move. They shift attention from abstract moralizing about machines to concrete engineering, testing, and governance practices that bind designers, program managers, testers, and commanders into a chain of responsibility. Yet principles do not automatically alter human behavior or system performance on a contested battlefield. Implementation is the stubborn center of the problem.

Human factors research gives us a blunt set of warnings. When people supervise automation they suffer predictable degradations in skill, situation awareness, and vigilance. Classic findings on the out-of-the-loop problem, automation bias, and complacency show that operators who rarely intervene become poor at recognizing when intervention is needed and at executing the control tasks they have ceded to machines. In other words, handing authority to an algorithm changes the human role in ways that can erode the very safeguards we expect humans to provide. These are not metaphors; they are experimentally observed phenomena across domains with high stakes.

The military context amplifies those human factors. Combat environments are noisy, morally fraught, and adversarial. Automation will be stressed by degraded sensors, spoofing, adversary deception, and rare edge cases that often lie outside training distributions. In these conditions, an overreliance on brittle models or a failure to calibrate operator trust can produce catastrophic outcomes. Human-on-the-loop supervision is attractive because it promises speed with oversight, but supervision without continuous engagement is an illusion of control. The relevant literature therefore emphasizes adaptive strategies that change the level of autonomy in response to assessed operator state and system confidence.

From these facts follows a practical taxonomy of risks that human factors must address if safety is to be real rather than rhetorical. First, loss of situational awareness and skill atrophy: if operators do not practice decisions they must be prepared to make, their ability to step in collapses. Second, miscalibrated trust: both overtrust and undertrust are dangerous because they respectively produce complacency or disuse. Third, automation brittleness and surprise: AI systems will encounter unanticipated contexts that produce inscrutable failures. Fourth, accountability and legal traceability failures: when humans cannot understand or reconstruct how a system decided, responsibility becomes opaque and governance weak. Each risk is addressable only by a mixture of engineering, training, doctrine, and evaluation. Several of those levers are already captured in national-level strategies but require amplification and sustained resourcing.

What does good practice look like? First, design for calibrated cooperation rather than buttoned-up bypass. Systems must expose uncertainty, alternatives considered, and confidence metrics in ways that are meaningful to operators. Explainability is not an academic checkbox. It is a human factors requirement that supports correct mental models and faster recovery when automation is surprised. Second, adopt adaptive autonomy: the system should lower its autonomy and increase human involvement when its model confidence or environmental match falls beneath validated thresholds. Adaptive schemes preserve human engagement without sacrificing the speed benefits of automation where appropriate.

Third, embed rigorous human-systems integration in acquisition. Safety cannot be inspected in after-the-fact fielding. Requirements, test plans, simulation envelopes, red-team adversarial scenarios, and verified human-in-the-loop exercises must be preconditions for escalation to operational use. The DoD’s Responsible AI pathways intend this, but contracting practices, incentives, and test infrastructure must align with that intent across program offices. Traceable engineering artifacts that capture dataset provenance, model training regimes, hypothesis tests, and failure modes are essential to post-incident learning and legal accountability.

Fourth, invest in persistent training and exercises that keep human skills fresh. Training should not be limited to occasional certification runs. It should simulate degraded autonomy, ambiguous sensor returns, and adversary deception at scale. These are not solely technical drills; they are exercises in moral and cognitive resilience. People must practice the art of reclaiming control under stress, with the muscle memory to act correctly and swiftly. The National Academies and other policy bodies identify this human-AI teaming competence as a research and workforce priority.

Fifth, measure human performance with real metrics. Certification should include human throughput measures, detection of automation failures, time-to-intervention, accuracy of supervisory decisions under varying workloads and stressors, and susceptibility to automation bias. These metrics must be part of the safety case for any system whose operational employment could cause harm. A safety case that omits human performance data is not a safety case; it is a marketing pitch.

Sixth, design governance that enforces governability. Human judgment is not a checkbox. It is an organizational capability that must be resourced, audited, and accountable. Architectural provisions to disengage, degrade gracefully, and preserve forensic logs are necessary but not sufficient. Policies must be clear about lines of authority and legal obligations in the chain from developer to field commander. Robust red-team campaigns and independent assurance bodies should be institutionalized to guard against optimistic assumptions baked into procurement decisions.

Finally, accept that some limits are epistemic. There will be classes of problems where current AI cannot provide reliable discrimination under realistic operational constraints. The respectable response is humility: do not weaponize classifiers where error imposes unacceptable moral or strategic costs. The more seductive the capability, the more strenuous should be the burden of proof. Ethical principles must translate to operational stoplights that prevent deployment until human factors and system reliability criteria are met.

Philosophically speaking, the safety of military AI hinges on a simple proposition. Machines can process and optimize across large data landscapes. Humans supply moral sense, prudential judgment, and responsibility under uncertainty. Safety emerges when the system architecture, doctrine, and training preserve the conditions for those human qualities to manifest when they are needed. If automation is introduced as a substitute for judgment rather than as its amplifier, the technology will hollow out agency and increase rather than reduce risk. If, instead, we design institutions that treat automation as an extension of human capacity with strict admission criteria, transparent behavior, and continual human engagement, then gains in tempo and accuracy can be translated into real reductions in avoidable harm.

Concluding recommendations for program managers and policy makers are modest and concrete. Enshrine human performance metrics in certification. Require adaptive autonomy and uncertainty reporting as part of operational releases. Fund synthetic training and regular red-team adversarial testing. Preserve traceable engineering artifacts for legal and forensic review. And finally, resist the rhetorical shortcut that a human “on the loop” is by itself sufficient assurance of safety. Human judgment must be preserved in practice, not only in policy. The future of safer military AI is primarily an exercise in preserving the quality of human judgment under pressure, and in engineering systems so that human beings remain capable, informed, and ready to act when the black box inevitably fails.