Simulations sponsored and run by the U.S. Department of Defense have become one of the most revealing laboratories for studying human-robot trust. Unlike armchair argument or single-case field tests, DoD sims compress many repetitions of interaction, failure, and repair into controlled sequences that produce data both rich and messy. That data is useful for engineers and ethicists alike because it separates two distinct quantities that are commonly conflated: objective capability and subjective trust.
Two illustrative examples frame the current state of evidence. First, the AlphaDogfight/ACE sequence of simulations showed that autonomous agents can outperform highly trained pilots in tightly bounded aerial engagements. Those simulated encounters were not offered as proof that autonomy can replace human judgment. They were used instead to probe how human operators perceive an algorithm that demonstrably outperforms them, and what kinds of interface, explanation, and supervisory roles are required before trust becomes operational rather than rhetorical. The trials were explicit about their goal to build human trust through repeatable, measurable tasks.
Second, large swarm and urban-environment simulations under the OFFSET banner explored different trust realities. In those experiments a small number of human controllers supervised dozens to hundreds of simple agents within a virtual or reconstructed environment. The emergent finding was familiar and important. Human operators can manage more autonomy at scale when the autonomy is predictable and when the interface communicates intent at the right grain. Predictability and intent communication are forms of epistemic trust that are distinct from the robot’s raw performance figures. In practice that means a swarm can be technically competent yet mistrusted if it behaves opaquely.
Complementing these program-level simulations are smaller empirical studies that probe the microdynamics of trust. Experimental work in human-subject settings shows that trust falls precipitously after a performance violation and that explicit strategies for uncertainty communication and apology materially affect recovery. In other words, designers must assume failures will occur and must bake in protocols that allow a team to recover. Simulations allow us to run those failure modes systematically and to quantify both immediate and longer term trust trajectories.
There is also active work on modeling how trust propagates in multi-human multi-robot teams. Computational models tested in controlled search and detection simulations show that humans form trust not only from direct experience with a specific robot but also from indirect reports and observed interactions between teammates and other robots. That propagation matters because modern formations will be heterogeneous. A single trusted agent can raise the perceived reliability of nearby agents, and conversely a single breach can contaminate trust across the team. These are not merely theoretical curiosities. They change how we should instrument simulations and how we should design reporting interfaces.
On the methodological side DoD-sponsored efforts are converging on two pragmatic refinements. The first is better synthetic agents and digital twins capable of producing naturalistic human behavior at scale. Programs such as DARPA’s exploratory human-AI team modeling seek to generate realistic human simulacra so that researchers can evaluate human-AI team performance without the prohibitive cost of live-subject trials. The second refinement is multi-modal measurement. Beyond questionnaires, labs are experimenting with physiological signals, task logs, and fine-grained behavioral markers inside the simulation to infer stress, cognitive load, and reliance in near real time. Those measures let researchers correlate objective task success with subjective trust and to identify mismatches where high trust masks poor calibration or where low trust suppresses high-performing autonomy.
What does the aggregated experimental record tell us about practice? First, trust is contextual and task specific. A robot that is trusted to navigate rough terrain need not be trusted to make lethal targeting decisions. Trust should be measured and reported for each role the system may occupy. Second, transparency matters. Simulations show that symbolic, comprehensible explanations of intent and plan matter more to human trust than opaque signals of confidence. Third, recovery matters. Systems that communicate uncertainty and that have calibrated repair behaviors maintain higher long term trust in repeated simulations than those that do not. These are empirical claims supported by controlled simulation data across multiple DoD programs and academic follow-ons.
There are also clear limitations in the simulation evidence. Many DoD sims create bounded, stylized problems that omit the moral and legal ambiguity found in real combat. Simulated adversaries may lack the adaptivity of human opponents operating outside rules of engagement, and physiological proxies cannot yet capture the full social dynamics of a deployed unit. Finally, there is a risk of ecological complacency. If trust is built inside sanitized sims and then carried uncritically to the field, we create brittle human-robot partnerships. A thoughtful simulation program insists on injecting adversarial variability and on validating key simulator-derived findings with limited, tightly supervised field trials.
For designers and policymakers the takeaway is direct. Use simulations to quantify both performance and perceived trust. Instrument sims for failure modes and for trust repair strategies. Model multi-agent trust propagation so that interface design can account for second order effects. Finally, remember that trust is not an engineering checkbox. Trust is an ongoing social accomplishment that requires institutional practices, training, transparent metrics, and moral attention. The DoD simulation corpus gives us robust ways to measure trust development and degradation. Now we must use those measures to design teams where human judgement remains central while autonomy expands the feasible options on the battlefield, not the accepted ones.