We are told that autonomy will remove humans from the most dangerous tasks. The claim has merit and it is already saving lives in constrained settings. Yet when autonomous systems leave the lab and enter messy, contested, real-world environments they routinely reveal brittle failure modes. These failures are not curiosities. They expose structural limitations in perception, decision making, testing methodology, and the social institutions that govern deployment.
High-profile accidents provide blunt evidence of that brittleness. In 2018 an Uber test vehicle struck and killed a pedestrian in Tempe, Arizona while operating under a developmental automated driving system. The federal investigation made clear that the vehicle detected an object but did not take corrective action in time and that some built-in safety features were not active during autonomous operation.
Regulatory and transparency data offer a broader view. Companies with permits to test autonomous vehicles in California reported driving millions of autonomous miles in the 2021 to 2022 reporting period, yet disengagements and collision reports remain routine enough to require public accounting. The California Department of Motor Vehicles published summaries showing both a rapid increase in miles and persistent, nontrivial counts of disengagements across permit holders.
Consumer safety agencies and investigative reporting have similarly documented repeated incidents involving partially automated driving systems. National regulators have logged hundreds of crashes involving advanced driver assistance systems in recent reporting periods, and Tesla’s Autopilot software in particular accounted for a large fraction of those recorded incidents in 2021 and 2022, triggering expanded agency probes. These data illustrate how an automated capability that works within a narrow operational design domain can still produce dangerous outcomes when used beyond that domain or when human supervision fails.
What causes these failures? At a technical level a major culprit is the sim-to-real gap, the mismatch between the environments in which machine-learned controllers are trained and the open world in which they must operate. Simulation is indispensable because real-world data are expensive, slow, and sometimes unsafe to collect. But policies and perception models that perform well in simulation often falter under distributional shifts, sensor noise, unmodeled dynamics, occlusions, and adversarial or novel scenarios. Researchers have proposed mitigation strategies such as domain randomization and grounded adaptation, and those approaches have produced notable successes. Nonetheless they remain partial fixes rather than full solutions.
Robotic competitions and public demonstrations make the problem visible in another way. The DARPA Robotics Challenge revealed, in dramatic fashion, how difficult it is to translate capability into reliability. Robots could execute individual tasks in curated trials but still fell, failed to manipulate objects robustly, or required extensive human intervention when conditions varied even modestly. The spectacle of progress was matched by the humility of repeated failure.
There is also a human factor dimension that technicalists must face. Automation changes responsibility and behavior. When a system appears to drive itself, human operators can experience mode confusion or complacency. Supervisory roles that were intended as safety backups sometimes become ineffective because they are poorly instrumented or because human attention degrades during long stretches of monitoring. Regulators have repeatedly warned that partial automation can create the illusion of full automation, producing misuse that the system designers did not and could not anticipate.
Testing practices and metrics amplify the problem. Miles driven in autonomous mode are a useful metric but a blunt one. Disengagement counts are subject to differing definitions and reporting practices, which complicates comparisons across teams. Without standardized, publicly auditable metrics for performance across a diversity of edge cases, the industry will continue to suffer from overconfidence and uneven risk exposure. The California reporting framework has improved transparency, yet it also highlighted the difficulty of extracting clear safety signals from aggregated corporate data.
What should practitioners, policymakers, and ethicists do in response? First, accept that incremental progress requires honest reporting of failures as well as successes. Second, align deployment decisions with validated operational design domains and require demonstrable robustness to distributional shifts before removing human controls. Third, invest in richer testbeds that intentionally expose systems to corner cases and adversarial conditions rather than only to nominal scenarios. Fourth, incorporate human factors engineering into system design so that supervisory roles are realistic, engaging, and auditable.
Technically minded readers will object that we already have partial remedies: robust sensor suites, redundancy, domain randomization, and online adaptation. Those are important. Yet none obviates the deeper truth that machine-learned autonomy is not a magic bullet. It is a set of tools with clear strengths and deep weaknesses. Treating autonomy as a replacement for careful systems engineering and social governance invites harm. Treating autonomy as a collaborator that requires new institutions for testing, reporting, and accountability gives us a chance to harvest its benefits while containing its failures.
If we are to place machines into harm’s way or reduce human presence on future battlefields, then the military and civilian sectors share a duty to get this right. That duty is technical, procedural, and moral. The history of real-world autonomy tests suggests a single lesson repeated in many forms: progress will be uneven, and failure modes will surprise us until we design tests, metrics, and social arrangements that respect the limits of current AI rather than pretend they do not exist.